Questions on Creating a New Voice

I got Mimic III up and running on a local machine. Everything is working and it’s pretty cool. Thanks to everyone who had contributed.

I’d like to have a couple more American-sounding (to my ear) voices available. I’d be happy to help create a voice, but I am pretty much at a loss as to how to start. Is the best way just to pull a copy of the mycroft recording studio, and start grinding away?

Once I get did get some recordings done, is there a tutorial/repo for training Mimic III voices yet? Or if I’m willing to donate the recordings to the public domain, is there anyone willing to do that training for me?

(Related) Is there a way to fine tune an existing voice, or should I aim for a full 40 hour dataset?

I know that’s a lot of questions, but I’m just really excited about the whole project. :smiley:

I don’t have all of the answers, but know a little about voice training from our Coqui TTS plugin (there’s a demo on Huggingface).

For recording a dataset, you will want to record short segments (about a sentence each) in a neutral tone and make sure to take breaks so you don’t strain your voice. I haven’t looked at the Mycroft recording studio personally, but if it associates audio files with text then that should be all you need for a good dataset.

I always encourage contributing under some permissive license (Creative Commons licenses are popular for datasets). It removes a lot of questions about licensing any output models if you’re explicit about the data license.

I know transfer learning exists for Coqui, so you can train voices with significantly less than 40 hours if you already have a good model (as there is for English). Training a voice in either case usually takes on the order of days with a good GPU.


To follow up on @djmcknight358’s comment, Mimic 3 voices can be fine-tuned. So you only need 2-5 hours of data to get a pretty decent voice. I’ve had success with as little as 30 minutes using the phonetically balanced Harvard Sentences.


Thanks for the balanced sentences. I’ll use them first as the corpus when I start recording. For anybody else that wants to use the Harvard Sentences as the corpus with the Mimic Recording Studio, I’ve formatted into the necessary csv here: Harvard Sentences Corpus

