AI Voice Cloning
The cluster focuses on AI models for voice cloning and speech synthesis from short audio clips, discussing quality, training requirements, comparisons to models like Tacotron2, WaveNet, LibriTTS, and OpenAI, as well as techniques like retrieval-based conversion.
Activity Over Time
Top Contributors
Keywords
Sample Comments
Has anybody tried making an AI that generates 5 seconds of arbitrary speech to feed into this AI?
The speech quality is very good. It's not clear from the page that are they just adding styling to already generated audio or the audio is all generated using their own model?
Wish there was training or finetuning code, as finetuning voices seems like a key requirement for any commercial use.
I want to explore the space of audio encoding and GPT like understanding of audio. I'm so highly interested in how a simple 1d signal must go through so much processing to be understood by language models, and am curious what tradeoffs occur. Would also be fun to make a TTS Library and understand it.
openai did this but it doesn't sound great, I think it's because sound has less information so the brain is very picky
WOW! I'm flabbergasted! Checkout `Compared to Tacotron2 (with the LJSpeech voice)` or `Prompt Engineering` section!
Is there an open source speech-to-speech model which retains intonation, cadence and delivery?
Similarity depends on many factors: recording quality, which language you're synthesizing in (models trained on more speakers do better), and diversity of prosody in your recording. Try recording for a bit longer and "acting out" a bit in your tone, that tends to give me interesting results :)
Take a look at Lyrebird as well https://lyrebird.ai/ ;)
I'm interested in using this for text to speech, rather than speech to text. Is wavenet still the state of the art for training on a dataset like this?