Poor quality on short phrases, good quality on long phrases

sjtilney · December 3, 2018, 10:15pm

I’m training the LJspeech dataset using a GPU. It’s reached about 310,000 steps. I’ve noticed that when I test on short phrases, it sounds like a robot, to the point where I’m not able to understand what it’s saying.
When I test on longer phrases, it sounds great, something I might be able to use in a production environment.
I haven’t adjusted any of the hparams other than changing batch size from 32 to 64. I’ve posted the sample output below. Any ideas as to why I’m getting these results on the shorter phrases?

Short phrase
“Hello, I am your virtual assistant”
Sample: https://drive.google.com/open?id=1OXkFrPI8o3UmGBXnmHA0MYkOiZImObcB

Long phrase
“Hello, I am your virtual assistant. How can I help you today? You can ask me anything.”
Sample: https://drive.google.com/open?id=1NC6_Ah6sd3L5fwQrfOO1OaCxFFxoi52A

baconator · December 3, 2018, 11:44pm

Sounds like bad alignment. If you run eval.py against the model, does that result in good audio? What do your step-???-align.png graphs look like? Are they getting to the near-solid diagonal line? What about really short phrases, like hello?

sjtilney · December 5, 2018, 5:07am

This is what the graph looks like after 355,000 steps: https://drive.google.com/file/d/1PH4bzb5OpIW4QSoD7EltEhTxpydpxCUf/view?usp=sharing

Short phrases, such as “hello” sound terrible. It will create 12 seconds of garbage audio.

Running eval.py has mixed results. Some of the output will sound perfect, while others sound like a robot, and some you can’t understand anything at all.

What are some things I can try to get better alignment?

baconator · December 5, 2018, 5:39am

Do you still have saved models from like 100k or 200k that you can evaluate against? See if they’re showing the same behavior?

I’d probably stop and restart things if it wasn’t aligning by like 30-40k and sounding somewhat reasonable by 100k. The alignment graphs shouldn’t be high triangles like that by 300k for sure. What hparams are you using, and any other differences of note?