Free german tts voice for mycroft (sneak preview)

Thorsten · September 18, 2020, 7:57pm

Hello.

We (some nice guys from mycroft community and me) are currently working on a free to use german tts voice based on my personal voice dataset contribution.

The model is based on a tacotron 2 combined with a pwgan vocoder. This model can be run locally and without cloud connection. We’re trying hard (try’n-error) to provide a free model with an acceptable quality in future for a daily usage, but we still have some work to do.

Nevertheless we wanted to show you some sample audio as „sneak preview“ what is currently possible.

Also available on soundcloud

Thanks @Dominik, @baconator, Repodiac, @Olaf

For more information on the dataset feel free to look at my github page:

SGee · September 18, 2020, 8:22pm

Followed your efforts over the last weeks, i’m curious what hardware (CPU/GPU) should be used to facilitate a flawless experience.

Olaf · September 18, 2020, 8:45pm

Currently you would need a GPU to produce speech in real time. So it would take 2 seconds to produce 2 seconds of audio, maybe 8-10 on a regular pi. So still not what we want, but up until now we didn’t have a free model at all. So a small step for us and a small step for mycroft @Dominik can give more info, because he has already tried it.

Dominik · September 19, 2020, 6:05am

I am running my tests on a Xavier AGX. (A direct comparison with graphics card is difficult but a GTX 10x0 with 8GB should give similar or even better results).

In best case I see a real-time factor of 0.3 (1 sec audio requires 0.3 seconds of processing). As the model still has some problems with “stop attention” this goes up to 5.0. Interestingly this happes with shorter phrases.

But with some tricks like caching of the synthesized audio files you will get a better experience.

Dominik · September 19, 2020, 8:04am

Still uncharted terrority for me: you can convert the PyTorch-model to TensorflowLite. This may result in better performance on a RPI…

j1nx · September 19, 2020, 9:03am

Just to say; Cool guys! Keep it up.

Olaf · September 19, 2020, 9:59am

The main take away for me was that the data can be used to produce a reasonably good model. In the beginning it didn’t work and we didn’t know why. Now we know that we can use Thorsten’s data and can try different configs or combinations. @Dominik thanks for the numbers.

SGee · September 19, 2020, 6:08pm

Little concerned that a top of the class board with 32 TOPS peaks at 5 RTF, but that’s maybe a configurational problem.

I wonder how the coral dev board resp. the broken out coprocessor (USB accelerator) would perform. Don’t like the idea to let my WinPC do the heavy lifting, since this would deem the PC to be powered 24/7.

Is this benchmarked using Mycroft?

Thorsten · September 19, 2020, 6:17pm

In addition to what @Olaf already said you might wanna take a look on this thread.

Dominik · September 19, 2020, 6:36pm

The current model needs some fine-tuning for shorter phrases (up to 6-7 words), longer sentences work better already.

Xavier-AGX has a very good “TOPS per Watt” value. Even in “max power mode” it idles around 5W and peaks at 30-40W.

baconator · September 20, 2020, 8:04am

For tacotron, a gpu would be ideal. I use nvidia 1030’s, they don’t draw much when idle and fanless models are available. Yes, this necessitates running a host with them in it 24/7, but for quality and speed you’re going to have to make some trade-offs.

We’re quickly approaching a place were cpu can be used instead of a gpu, so this answer may change in the next year.

SGee · September 20, 2020, 9:50am

Very recently an article popped up on how to setup a Win (easily reproduceable in linux) deepspeech server for Mycroft

Yet, Mozilla trained models seem a little bit different with using pbmm and some separate scorer.

Like the exclusion

Got a stripped naked 1080 (only 2GB dedicated though)

Thorsten · September 21, 2020, 6:52am

I already thought that’d catch someone’s eye .
I’m trying to be “nice” though, but if i’m successful in it should “the other guys” say.

Dominik · September 21, 2020, 8:07am

I can gladly confirm that Thorsten is a nice guy, too.

SGee · September 21, 2020, 10:29am

As it turns out this is a mmap-able format for inferencing. The pretty easy process to convert pb to pbmm is described here

baconator · September 21, 2020, 3:29pm

This would be for deepspeech, not TTS.

SGee · September 21, 2020, 3:44pm

The Deepspeech server is serving STT/TTS. I just don’t think it will run another model type.

baconator · September 21, 2020, 3:58pm

Deepspeech just does STT. Tacotron is TTS.

This is about the later.

SGee · September 21, 2020, 9:31pm

@baconator
Oh OK, now i’ve dug a little deeper i saw that the articles talking about 2 servers with the second model already packaged, so i haven’t recognized it as such.

So, STT aside. Is the TTS serveing (described in the how-to) still viable? Or what would you suggest?

Is STT modeling sourced from one speaker beneficial?

@Thorsten @Dominik Have you planned to upload the model?

baconator · September 21, 2020, 9:48pm

If you’re referring to TTS/TTS/server at master · mozilla/TTS · GitHub then yes, this is still viable and I just sent off a package earlier today using that.

Possibly for that one person. A wider set of submitted data would almost certainly help, even if the bulk of the data was from one person.