AI Voice Cloning

The cluster focuses on AI models for voice cloning and speech synthesis from short audio clips, discussing quality, training requirements, comparisons to models like Tacotron2, WaveNet, LibriTTS, and OpenAI, as well as techniques like retrieval-based conversion.

➡️ Stable 0.6x AI & Machine Learning
4,174
Comments
20
Years Active
5
Top Authors
#6459
Topic ID

Activity Over Time

2007
1
2008
5
2009
17
2010
27
2011
31
2012
35
2013
22
2014
61
2015
87
2016
226
2017
333
2018
242
2019
250
2020
347
2021
216
2022
315
2023
659
2024
676
2025
555
2026
69

Keywords

HMM e.g lyrebird.ai replicastudios.com TensorFlow youtu.be TTS STT PR AI speech audio voice voices model models trained text cloning speaker

Sample Comments

badrequest Nov 13, 2019 View on HN

Has anybody tried making an AI that generates 5 seconds of arbitrary speech to feed into this AI?

smusamashah Feb 6, 2024 View on HN

The speech quality is very good. It's not clear from the page that are they just adding styling to already generated audio or the audio is all generated using their own model?

cuuupid Sep 19, 2024 View on HN

Wish there was training or finetuning code, as finetuning voices seems like a key requirement for any commercial use.

robviren Dec 26, 2025 View on HN

I want to explore the space of audio encoding and GPT like understanding of audio. I'm so highly interested in how a simple 1d signal must go through so much processing to be understood by language models, and am curious what tradeoffs occur. Would also be fun to make a TTS Library and understand it.

donkarma Aug 30, 2022 View on HN

openai did this but it doesn't sound great, I think it's because sound has less information so the brain is very picky

ducktective May 18, 2022 View on HN

WOW! I'm flabbergasted! Checkout `Compared to Tacotron2 (with the LJSpeech voice)` or `Prompt Engineering` section!

paulkon Aug 14, 2024 View on HN

Is there an open source speech-to-speech model which retains intonation, cadence and delivery?

reubenmorais Jan 3, 2022 View on HN

Similarity depends on many factors: recording quality, which language you're synthesizing in (models trained on more speakers do better), and diversity of prosody in your recording. Try recording for a bit longer and "acting out" a bit in your tone, that tends to give me interesting results :)

rrecuero Nov 10, 2017 View on HN

Take a look at Lyrebird as well https://lyrebird.ai/ ;)

The_Amp_Walrus Feb 28, 2019 View on HN

I'm interested in using this for text to speech, rather than speech to text. Is wavenet still the state of the art for training on a dataset like this?