Build an open future with us.

Invest in Mycroft and become a community partner.

Mimic II pre-trained model


Does Mycroft share any of the pre-trained models for Mimic II? I’ve found the models build for the original Mimic here:, but I can’t find anything for Mimic II.


The model itself isn’t necessarily open source*. You can access it once you configure mimic2 on your instance of mycroft, of course. There’s probably some value in having the LJSpeech* corpus modeled and available at some point as well.


I’ve looked at the publicly available datasets, I was hoping to avoid training a model from scratch because training on a CPU will take weeks to get a decent results, and running on a GPU is pretty expensive.


If you wait a couple weeks, I might be able to give the LJSpeech set a try.


That’s right, at the moment the pre-trained Mimic 2 voice, the Kusal voice, is not available - as it’s a premium offering for our Subscribers ($2 a month, good value!)


Hey, I was going to make a new thread but decided I might as well ask here.
How does making a voice model work with Mimic II? Can it be any recording with a transcription, or does it have to be a certain set of phrases? Is it possible to set up a way to volunteer your voice? Or perhaps use long audio recordings like Librivox? I don’t have the best voice but I would love to attempt to create a voice (mostly to study how it works, I have to have a goal). I love Mycroft but I really hate the Alan Pope voice. Of course, I understand why premium voices are necessary, but I would love more voices.


It can be any set a phrases, but more is almost always better. The Mimic II repo doesn’t provide any tools for collecting text and voice data, you’ll have to do that on your own, and also write your own pre-processor.
Details on how to do that are in the repo


Ok , so I’m also trying to see if I’m up to the challenge of making a voice. Assuming I have all necessary audio and the preprocessor, what do I do then? How long will it take on an average laptop. And this is probably a question for a team member, but is it legal to use a librivox recording?


At M-AILABS you find audio-material for training voices (some of it based on Libirivox), ready to run with Tacotron/Mimic and „free to use“. Still it might be a nice move to contact the speaker and notify them that a voice-assistant is speaking with their voice.

But beware: with a normal desktop CPU it will take weeks and months to get usable results. A GPU (GTX 1080 or better) is highly recommended.


All librivox recordings are in the public domain, so you should be fine. I’m using a particular one for my testing, in fact. LJSpeech and M-AILABS all use librivox for data sources.

The google tacotron voices were built with 20-44 hours of a high-quality, highly regulated professional voice artist. I believe the Kusal voice is 16 hours of high quality recordings from a well-trained speaker. LJ Speech is done from 128kbs MP3 files, converted to wav. I have 13 hours (9.5k clips) so far for my dataset and it’s being cranky about working well. The transcriptions should be accurate as possible. Any significant number of proper names or pronunciations should probably be added to your local CMU-Dict file. Basically, review the LJSpeech dataset to compare your data with. My set has about 11k distinct words, of which 900 were not in CMU-Dict. Of those 900, the majority are proper nouns and their possessive contractions, followed by “un” prefixed words, “s”, “ly”, “ies” suffixed works, then misspellings or odd variations (UK vs US). The vocal speed should be as uniform as possible. The assortment of clips should be as even across the time range (1-10 seconds) as possible. The formatting of your data should probably follow one of the existing types now, then you can re-use the preprocessing scripts easier for it. Mimic2’s analyze function can serve you well for evaluating your dataset.

You will not want to train on a laptop, unless you hate your laptop and never want to end up with anything usable. Get a gcp or aws gpu instance and train there if nothing else. An nvidia 1070 or 1080 can be used for training, lower end gpu’s will run you into more and more issues as you get to lower end hardware. A single 1070 does about 30-50k steps per day, and you will want to train until you find the point of overtraining…large datasets this is probably 300k+. When training, you should, with a good dataset, see alignment by 25k steps. Based on your hardware, you will want to adjust the hparams in various directions. Less data, lower training rate. More device memory, larger batch size/lower outputs per step. There’s a few dozen knobs and buttons to tweak along the way.

Also feel free to check the chat server’s machine learning channel as well.


So it seems this isn’t going to be my project, though keep us updated about your voice. Mostly I just wanted a better quality voice for myself, however I could obtain that. Do you know of anyone who has made a better mimic 2 voice.


The Kusal voice is quite good (see Kathy’s comment above). Other than that, no, I don’t know of anyone who’s made public a voice model for it yet.
(eta) Looking at the LJSpeech data, I will try and model that to a reasonable length this weekend, and see what comes out.