Mimic2 custom training produces "horizontal high line" line in alignment graph

Thorsten · April 9, 2020, 4:52am

I’m currently training my own german voice for “free-to-use” by community based on mimic2.
But alignement graph shows always a high horizontal line and produces some too robotic speech.

Will this line go away with more training steps (currently on 110k) or how can i fix this?

step-110000-align

The generated tts wav file is available here:
https://drive.google.com/open?id=1A6cMpOxRtopXRwLpFeBCr31cBUxWr_gF

The alignement example on GitHub - MycroftAI/mimic2: Text to Speech engine based on the Tacotron architecture, initially implemented by Keith Ito. does not have this problem.

Would be thankful for any suggestion on how to fix this.

Thorsten

baconator · April 9, 2020, 6:40pm

I was never able to get this to go away with different datasets, including LJ.

Thorsten · April 9, 2020, 8:24pm

Thanks for your reply.

Do you know if the line is the reason for the voice sounding robotic with little reverb?
How many steps did you train during your tests? I’m currently on step 127k.

Thorsten · April 9, 2020, 8:38pm

When running eval.py crazy alignment graphs are generated for (german) test sentences. But sentence no. 3 creates a “good” one.
eval-128000-4 eval-128000-3 eval-128000-2 eval-128000-1

baconator · April 9, 2020, 10:34pm

up to 400k at one point. It never got better. Basically 50k is the cut off…if alignment hasn’t happened by then I’d kill it and try something else. Also the large spikes were places I’d stop and backup and start again on. That seemed to help a bit.

Thorsten · April 15, 2020, 3:41pm

Thanks for your reply.

400k training steps sounds like a lot of invested compute time.

I’m currently running a training using original tacotron1 implementation (GitHub - keithito/tacotron: A TensorFlow implementation of Google's Tacotron speech synthesis with pre-trained model (unofficial)) with tensorflow 1.6 instead of mimic2 implementation and tensorflow 1.8 on the same ljspeech dataset.
The results and alignments seems (much) better.

Could be interesting to check the reason for the different results is a code change in mimic2 or use of tensorflow 1.6 instead of 1.8?

Here are some sample outputs from mimic2 vs. tacotron1 training.

Sample WAVs

Mimic2 alignment graphs

alignment 35k and 50k steps

mimic2-step-50000-align mimic2-step-35000-align

Tacotron Version 1 (Keithio) graphs

alignment 35k and 50k steps

keithito-step-50000-align keithito-step-35000-align

Tensorboard (at 50k steps)

baconator · April 15, 2020, 6:59pm

This replicates what I experienced as well.

Thorsten · April 15, 2020, 7:01pm

So you tried tacotron1 too, instead of mimic2 fork?

baconator · April 15, 2020, 7:20pm

Yes. This was how I built the first model I used for my local TTS, and it sounded pretty good.

Thorsten · April 15, 2020, 8:23pm

Was this your 400k training or to how many steps did you train your model till being good enough for every day usage?

baconator · April 15, 2020, 8:49pm

I did 400k a couple of times on each. The model I ended up using was about 350k on tacotron (1). None of the current mimic 2 models ever well worked for me, there’s a previous commit before the attention stuff was added that worked closer to tacotron.

I’ve been waffling about trying a moz tts t2 modeling run with another dataset, but haven’t gotten the time to get it set to go quite yet.