Mozilla's TTS comparison


#1

It seems like they are based on the same model as Mimic2 is, but uses pytorch and is focused on being lite weight (with the hope it could even run on a rasberryPi in the future).


#2

Interesting to see if it is better at longer sentences.

The Nancy dataset (application required) definitely appears to be a better set to use based on their results. Also curious if they’re going to add lpcnet to that instead of wavernn.


#3

We talk to Eren and the Mozilla machine learning team every week or so and are sharing ideas and some training data. We are looking at different aspects of thing with the Mozilla team more research-y while we are more focused on producing usable tools and putting voices into production. Neither is “right”, both are valuable approaches and I think we are benefiting each other quite well – as any good collaboration should!

The Nancy dataset is interesting, but the licensing terms are a little odd. They cannot really do anything with the trained models outside of the labs. We have shared our Kusal dataset and are building a new female voice from the exact same corpus. This is giving us great insights into what is important in producing a high-quality voice.

Hint: Consistency appears to be more important than shear volume of data.