Hello everyone. Recently I have seen the new mark II demo on YouTube and the new tts voice based on Alan Popey voice have improved a lot. My question is where can I find it and how I could implement it ? Is it just a new voice pack ?
Hi @takov751
I think you are referring about Mimic 3 which is not yet release, maybe @synesthesiam could provide more details about the roadmap.
Boy, searching for “Mimic 3” on YouTube sure is complicated by some competing search terms.
Those demos look very cool.
How many hours of recordings are required for high-quality voices? I know Mimic 2 needed about 16 hours or 20,000 sentences, has this been reduced at all?
What are some of the new features?
And of course, when can we expect to use it?
Building a model from scratch with Tacotron2 or VITS still requires some hours /thousands of recorded sentences.
Now there is YourTTS which - on a pretrained model - requires only a minute of your voice: YourTTS: Zero-Shot Multi-Speaker Text Synthesis and Voice Conversion / Blog / Coqui
I would be interested in a road map, and more or less testing it . Because that would be a neat feature to use it. And to even let’s say create my own tts from audio source. I luv your docker images btw. I used it back then to create shorter audio books
@sparkyvision I’ve had success with as little as 30 minutes of voice, but from a specially chosen set of phonetically-balanced sentences.
Some of the new features (relative to Mimic 2) include:
- Entirely local, runs with ~0.5 real-time factor on a Pi 4 with 64-bit OS
- Can be accelerated by a GPU
- Support for a subset of SSML (with custom pronunciations)
- Can contain 100’s of voices in a single model
- Currently supports English (U.S.), but I plan to train up all of the Larynx voices
- Runs on Linux and Windows
@Dominik For each language, I train an initial model from the largest dataset i have, and then use that to fine-tune other voices. I still need more data than YourTTS, of course.
@takov751 It’s getting close to something I would feel comfortable releasing. The training process is still a bit complex, unfortunately. I’m hoping to wrap it all up in a Docker image along with some pre-trained models so others can fine-tune their own voices
And, of course, I’d like to release Mimic 3 Docker images and Debian installers to make it easier for folks to try it out!