New male british tts voice

Hello everyone. Recently I have seen the new mark II demo on YouTube and the new tts voice based on Alan Popey voice have improved a lot. My question is where can I find it and how I could implement it ? Is it just a new voice pack ?

Hi @takov751

I think you are referring about Mimic 3 which is not yet release, maybe @synesthesiam could provide more details about the roadmap.

Boy, searching for “Mimic 3” on YouTube sure is complicated by some competing search terms.

Those demos look very cool.

@goldyfruit thanks for the quick reply . That’s probably what I am after.

Hi @takov751, happy to answer any questions about Mimic 3 :slight_smile:

How many hours of recordings are required for high-quality voices? I know Mimic 2 needed about 16 hours or 20,000 sentences, has this been reduced at all?

What are some of the new features?

And of course, when can we expect to use it? :slight_smile:

Building a model from scratch with Tacotron2 or VITS still requires some hours /thousands of recorded sentences.

Now there is YourTTS which - on a pretrained model - requires only a minute of your voice: YourTTS: Zero-Shot Multi-Speaker Text Synthesis and Voice Conversion / Blog / Coqui

I would be interested in a road map, and more or less testing it :stuck_out_tongue: . Because that would be a neat feature to use it. And to even let’s say create my own tts from audio source. I luv your docker images btw. I used it back then to create shorter audio books :smiley:

1 Like

@sparkyvision I’ve had success with as little as 30 minutes of voice, but from a specially chosen set of phonetically-balanced sentences.

Some of the new features (relative to Mimic 2) include:

  • Entirely local, runs with ~0.5 real-time factor on a Pi 4 with 64-bit OS
  • Can be accelerated by a GPU
  • Support for a subset of SSML (with custom pronunciations)
  • Can contain 100’s of voices in a single model
  • Currently supports English (U.S.), but I plan to train up all of the Larynx voices
  • Runs on Linux and Windows

@Dominik For each language, I train an initial model from the largest dataset i have, and then use that to fine-tune other voices. I still need more data than YourTTS, of course.

@takov751 It’s getting close to something I would feel comfortable releasing. The training process is still a bit complex, unfortunately. I’m hoping to wrap it all up in a Docker image along with some pre-trained models so others can fine-tune their own voices :slight_smile:

And, of course, I’d like to release Mimic 3 Docker images and Debian installers to make it easier for folks to try it out!

4 Likes