AI Voice Cloning

The cluster focuses on AI models for voice cloning and speech synthesis from short audio clips, discussing quality, training requirements, comparisons to models like Tacotron2, WaveNet, LibriTTS, and OpenAI, as well as techniques like retrieval-based conversion.

➡️ Stable 0.6x AI & Machine Learning

4,174

Comments

Years Active

Top Authors

#6459

Topic ID

Activity Over Time

2007

2008

2009

2010

2011

2012

2013

2014

2015

2016

226

2017

333

2018

242

2019

250

2020

347

2021

216

2022

315

2023

659

2024

676

2025

555

2026

Top Contributors

echelon (54) woodson (51) lunixbochs (22) modeless (20) yorwba (19)

Keywords

HMM e.g lyrebird.ai replicastudios.com TensorFlow youtu.be TTS STT PR AI speech audio voice voices model models trained text cloning speaker

Sample Comments

badrequest • Nov 13, 2019 • View on HN

Has anybody tried making an AI that generates 5 seconds of arbitrary speech to feed into this AI?

smusamashah • Feb 6, 2024 • View on HN

The speech quality is very good. It's not clear from the page that are they just adding styling to already generated audio or the audio is all generated using their own model?

cuuupid • Sep 19, 2024 • View on HN

Wish there was training or finetuning code, as finetuning voices seems like a key requirement for any commercial use.

robviren • Dec 26, 2025 • View on HN

I want to explore the space of audio encoding and GPT like understanding of audio. I'm so highly interested in how a simple 1d signal must go through so much processing to be understood by language models, and am curious what tradeoffs occur. Would also be fun to make a TTS Library and understand it.

donkarma • Aug 30, 2022 • View on HN

openai did this but it doesn't sound great, I think it's because sound has less information so the brain is very picky

ducktective • May 18, 2022 • View on HN

WOW! I'm flabbergasted! Checkout `Compared to Tacotron2 (with the LJSpeech voice)` or `Prompt Engineering` section!

paulkon • Aug 14, 2024 • View on HN

Is there an open source speech-to-speech model which retains intonation, cadence and delivery?

reubenmorais • Jan 3, 2022 • View on HN

Similarity depends on many factors: recording quality, which language you're synthesizing in (models trained on more speakers do better), and diversity of prosody in your recording. Try recording for a bit longer and "acting out" a bit in your tone, that tends to give me interesting results :)

rrecuero • Nov 10, 2017 • View on HN

Take a look at Lyrebird as well https://lyrebird.ai/ ;)

The_Amp_Walrus • Feb 28, 2019 • View on HN

I'm interested in using this for text to speech, rather than speech to text. Is wavenet still the state of the art for training on a dataset like this?