Google used over 24 hours for their assistant voice, and they wrote the papers for tacotron.
Most tacotron-based (tacotron1/2, mimic2, coqui w/o vocoders) models worth listening to use well over ten hours of data. I would consider that the minimum for a high-quality end result. The quality of that set is also key: any hiss, hum, or background noise will be magnified by training. Turn up the samples of LJ and you can hear a bit of it. Thorsten’s first chunk of data required a lot of work to get it cleaned, and he updated his recording methods after that to improve it.
You can certainly try with less to see what you get. Other types might be able to better with less (fastspeech 2?).
Those are some things I didn’t know. I actually switched to the Google Voice from Mimic2 because I didn’t like the voice, ha. Well then, I’ll have to find someone willing to donate their voice. I certainly don’t want my assistant to sound like me.
@baconator, do you know if there are any of those various “need a smaller sample set to make a good model” methods that are usable? Or do those currently only exist in research papers and on university servers? I’ve done a good deal of searching and haven’t found any that have a command-line utility that works like precise to plug in data, get model.
A number of the other end to end TTS models do have code, albeit some are created by people reading the paper and trying to emulate vs. being released with the paper. Papers with code is your friend here. That and searching github (be wary of forks with no updates). Coqui, though, does seem to be making the most forward progress of any of them.
As for simplicity, precise is about as easy as it gets.
Quality depends on a number of factors, not only number of records or recording hours:
For the Thorsten-DE voice community member @Thorsten recorded 23 hours audio.
Voice actors are probably the wrong tree. A voice actor doesn’t want to create a synthetic voice that sounds like them - especially not under a permissive license! Their voice is their trade.
More than once, I’ve seen a forum post about Moz/Coqui where a voice actor replies, “Why would anyone do this?!” not realizing that the call for help is directed at regular people, speaking normally, not at voice actors enunciating.
Now, if you wanted to create a specific voice, you’d obviously need someone who could make those sounds. But is that the same goal? You could make the trade worthwhile if you could pay the voice actor a whole bunch of money, including royalties, for a voice pack you then sell along. But that’s a product, not a hobby, and certainly not a community service.
tl;dr forget about voice actors it’s the wrong solution
I dunno, the actor I found on Fiverr didn’t seem to care what I was going to use it for, even after I explicitly told her I was interested in creating a virtual assistant voice. To her, it’s just words that she licenses for commercial use. Which, as far as I can tell, is about all the licensing you’d need…there’s no ongoing licensing agreement unless they ask for it after having been informed what you’re using it for. I very much doubt the voice actor who did Siri gets a kickback every time I summon her on my phone.
But I see what you’re saying. I’m more interested in creating a specific kind of voice. It doesn’t need to be a clone of an actor, though the uber nerd in me certainly wouldn’t mind. As long as it’s got the sort of neutral intonation that I’m looking for, I’ll be alright. I gravitated to voice actors because: 1) I can choose the voice 2) They sell saying things into microphones for a living.
But, if a VA is the wrong place to look…to whom else does one turn?
You probably want to look for voice-over/dubbing speaker. I guess most “voice artists” can offer a neutral voice style as well.
Surely there must be someone out there who has the right combination of vocal factors and has had a dream since childhood to be turned into the voice of an open-source digital assistant project.
 And I will find them!
So you need a person like me (with more vocal talent) and in your language (or a talented female version of me) .
As you mentioned the Harvard sentences you want to use your assistant in english, or?
What about LJSpeech dataset?
I feel bad because we’re doing a good job derailing OP’s thread. Sorry, OP!
I’m a bit of a…nerd. There’s lots of ways I could get lots of “just fine” voices. But I have sort of a nostalgia thing going for the voice of Majel-Barret Roddenberry, who played the voice of the computer on Star Trek. I know…I know. There are so many more important things I could be doing with my life.
See, if I were an eccentric billionaire this would be no problem. Alas.
On a slightly different note about intent…a good digital assistant voice has some features that I want. A deliberately neutral intonation - a sort of “informational” tone. I don’t know how better to describe it. Neither too high nor too low - right in the middle in terms of pitch. I think those things will require training a model from scratch with the right “voice talent”, wherever I can find that.
I found really useful what you guys were discussing
of course not. xtts v2 is out and both open source, and quite good. Use that instead of whatever this guy’s trying to shill for.