Introducing Mimic 3

@synesthesiam said previously in a Rhasspy forum thread he tried that one before and the quality isn’t good…

It was, but people told me that the voice I trained wasn’t understandable. I used this dataset: https://github.com/Edresson/TTS-Portuguese-Corpus

Do you know of any other TTS Portuguese datasets?

edit:
Listening to the recordings resulting from the TTS experments on this page TTS-Portuguese Corpus In my opinion the audio file results from of the Portuguese TTS is perfectly fine and understandable - the results of experiment #1 and #3 on that page. Experiment #2 was also understandable but had added noise distortion.
For example this wav file result of the longest phrase from Experiment #3 is very good and highly understandablef:
Hoje é fundamental encontrar a razão da existência humana
So @synesthesiam I wonder why when you used this data set your results were not understandable? The above results are pretty much the same quality as you get when using the Google Translate page to generate Portuguese TTS audio. It’s good!

  • Experiment 1 uses the DCTTS model, trained in the TTS-Portuguese Corpus, and vocoder RTISI-LA (Good).
  • Experiment 2 uses the Tacotron 1 model, trained in the TTS-Portuguese Corpus (Bad)
  • Experiment 3 this experiment explores the use of the TTS Mozilla model, trained in the TTS-Portuguese Corpus (Very Good)

Interesting results. Edresson is a knowledgable person when it comes to TTS, e.g. see his work on YourTTS.

@synesthesiam Does it make sense if I apply my audio preprocessing chain (that I have used for Thorsten-DE) to the TTS-Portugese-Corpus?

Looks like he’s contributing his TTS knowledge to the Coqui.ai project.

I wonder if I need to just train a model directly on characters rather than trying to use a phonemizer. My most recent attempt in Mimic 3 used the pt-br voice from espeak-ng.

Looking at Edresson’s model config, his audio settings are a bit different from mine. For example, his sample rate is 20000 instead of 22050, and he’s using “preemphasis” which appears to filter the audio before training. So maybe the problem is my naive use of the data directly without enough preprocessing?

Hi, sorry for bothering you. Do you have the plan for releasing the training source code yet? Thank you, looking forward to your information.

I’ve published a video on my Youtube channel showing all ways to install/run Mimic 3 and first steps to synthesize audio by CLI oder local WebUI :slight_smile:.

Just in case it’s interesting for you.

2 Likes

Hi @rostom132, no bother :slight_smile:
I do have a plan, but not definite release date yet. The training software as it is right now is the result of a year and a half of experimentation, including a lot of dead-ends. I’m cleaning it up now and removing a lot of the unused code. My hope is to make it work closely with Mimic Studio.

1 Like

Thanks for the awesome video, @Thorsten! I was very happy to see all the installation methods worked out well :slight_smile:

For anyone curious about the SSML volume, it currently just goes from 0-100%. Something I need definitely need to fix :+1:

1 Like

You’re welcome @synesthesiam :slight_smile:

In general a volume value between 0 and 100 makes sense, but just in case you want your TTS voice “yelling” at you a higher value could make sense too :wink:.

1 Like

Hello,
My final goal would be to record enough audio for slovak voice and then later on use espeak-ng phonemizer with that recording to train the voice.
Slovak is somewhat similar to czech and given there are some czech recordings available, perhaps I can first get to understand the whole process withese czech data and then move on to record slovak recordings.
There are some czech voices for festival with available recordings.
For example here is the list of words: voice-czech-ph/words at master · brailcom/voice-czech-ph · GitHub
And here are the actual recordings: voice-czech-machac/wav at master · brailcom/voice-czech-machac · GitHub
These are so called diphone voices for festival so these recordings include list of words where each word features concrete preselected syllable used for training.
Would it be doable and does it make sense to use these recorded words for training mimic3 czech voice? Is likelly to provide better results than festival?

If I manage to record or otherwise source reasonable number of recordings for slovak voice let’s say a few hours will I be able to train the voice on my own using my laptop or my computer or does it need much more power?
I see how LJ speech or thorsten german recordings are structured. Should I move on to recording or am I supposed to know how it all works before recording? In other words does my experiment with existing czech recordings make sense?

Greetings

Peter

1 Like

Hi @pvagner, welcome :slight_smile:

The czech recordings are a good start, but you would also need full sentence recordings. Single words are important to include too, of course – you can hear Mimic 3 struggle with them for many voices because the whole dataset was just full sentences.

It might, but Mimic 3 needs full sentences to get the right inflection and pacing of a sentence. I don’t know for sure, but I’d guess that single words only would result in an unnatural sounding voice.

For training, I’ve struggled to train voices on a GTX 1060 6GB. I’d recommend something with at least 8-10+ GB of VRAM. If you’re willing to provide the dataset with an open license, I’d be happy to train the voice for you on the GPUs I have here at home :slight_smile:

I’ve been working on a tool that might save you some time. It takes text from existing corpora like the Oscar corpus and tries to create a small phonetically balanced list of sentences to read. It looks like Oscar has Slovak; would you be willing to help create me create a dataset? I need a native speaker to review sentences and ensure they make sense and (because they come from the internet) are not advertisements for adult material :see_no_evil:

2 Likes

Just in case you are using Home Assistant and would like to use it with Mimic 3.

1 Like

Hello again,
Excuse me for the late reply.
Yes I am definatelly interested. I don’t have access to such powerfull machines so I’ll take it reasonable that I won’t be able to build it my-self. Still I’d be happy to move forward with this thus I’d like to help creating the text we should read. That way we can ensure the licencing is correct.

2 Likes

Hello.
I am planing to use mimic3 as TTS engine in my project, which will be use QT Speech (with help of speech-dispatcher) on RPi4/CM4. For building OS image I use boot2qt (Yocto based). I see mimic1 in Yocto. Any plan to support mimic3? (I newbie in Yocto and can’t do it myself yet :slight_smile: )