We (some nice guys from mycroft community and me) are currently working on a free to use german tts voice based on my personal voice dataset contribution.
The model is based on a tacotron 2 combined with a pwgan vocoder. This model can be run locally and without cloud connection. We’re trying hard (try’n-error) to provide a free model with an acceptable quality in future for a daily usage, but we still have some work to do.
Nevertheless we wanted to show you some sample audio as „sneak preview“ what is currently possible.
Currently you would need a GPU to produce speech in real time. So it would take 2 seconds to produce 2 seconds of audio, maybe 8-10 on a regular pi. So still not what we want, but up until now we didn’t have a free model at all. So a small step for us and a small step for mycroft @Dominik can give more info, because he has already tried it.
I am running my tests on a Xavier AGX. (A direct comparison with graphics card is difficult but a GTX 10x0 with 8GB should give similar or even better results).
In best case I see a real-time factor of 0.3 (1 sec audio requires 0.3 seconds of processing). As the model still has some problems with “stop attention” this goes up to 5.0. Interestingly this happes with shorter phrases.
But with some tricks like caching of the synthesized audio files you will get a better experience.
The main take away for me was that the data can be used to produce a reasonably good model. In the beginning it didn’t work and we didn’t know why. Now we know that we can use Thorsten’s data and can try different configs or combinations. @Dominik thanks for the numbers.
Little concerned that a top of the class board with 32 TOPS peaks at 5 RTF, but that’s maybe a configurational problem.
I wonder how the coral dev board resp. the broken out coprocessor (USB accelerator) would perform. Don’t like the idea to let my WinPC do the heavy lifting, since this would deem the PC to be powered 24/7.
For tacotron, a gpu would be ideal. I use nvidia 1030’s, they don’t draw much when idle and fanless models are available. Yes, this necessitates running a host with them in it 24/7, but for quality and speed you’re going to have to make some trade-offs.
We’re quickly approaching a place were cpu can be used instead of a gpu, so this answer may change in the next year.
@baconator
Oh OK, now i’ve dug a little deeper i saw that the articles talking about 2 servers with the second model already packaged, so i haven’t recognized it as such.
So, STT aside. Is the TTS serveing (described in the how-to) still viable? Or what would you suggest?
Is STT modeling sourced from one speaker beneficial?