So all the mimic 3 models are listed as low quality? Does this indicate Medium/high or even ultra quality models are a possibility today?
I’m both curious about quality of these models and real time factor on different hardware. Obviously low quality on pi4 is .5 so I assume we can’t go much higher. But different use cases may allow different hardware.
Yes! There are quite a few hyperparameters that can be tuned on the VITS model. VITS is a combination of GlowTTS and HiFi-GAN, so Mimic 3’s current notion of “low” and “high” quality currently map to the v3 and v1 Hifi-GAN configs.
However, there are other parameters that also influence quality and real-time factor. I’m investigating the effects of changing the following parameters right now:
- Audio sample rate (currently 22050 Hz, testing 16000 Hz)
- Number of hidden/inter/filter channels
- Whether or not the input has a “0” after every symbol (interspersed padding)
With a reduced sample rate, fewer channels, and no padding, I can get a real-time factor of about 0.3 on a Raspberry Pi 4, so there is definitely room for improvement!
Going the other direction, a “high” quality model can likely be improved by increasing the number of channels. If you have input audio with a higher sample rate, it should be possible to train a model at that rate (e.g., 44.1Khz). I haven’t tested anything in this range though, since I’m focused on what will run well on the Mark II