Uploading my own voice profile for the assistant?

jobyjreuben · May 3, 2021, 4:07pm

Hello community,

Happy to look on to the Mycroft Project. I would like to know how Mycroft updates/ uploads each voice profile and I would like to make my own voices as my voice profile in Mycroft. If I can get any discussion/links/studies/documents for it would be useful. And also I would like to know what format, specific info about the dataset of the voice to upload for the profile

Thank you in advance

ChanceNCounter · May 3, 2021, 6:38pm

Can you clarify which part of the tech stack you mean?

If you’d like to replace Mycroft’s voice with one based on your own, this would require a third-party text to speech system. There are several, and they are not hard to plug in, but getting a large enough corpus to generate a good voice is a tremendous undertaking. The no-longer-Mozilla one uses a single, public corpus to generate only a handful of voices. Untold thousands of samples, of many different people, repeatedly blended with ML… lots of work.

If you’d like to tune Mycroft to understand your voice better, again, making it use just your voice would be a huge undertaking and probably impossible. However, you could consider contributing samples to the no-longer-Mozila speech to text system, which is quite functional. This will make your samples part of the dataset that system uses and refines to interpret what you say.

If you’d like to teach Mycroft to recognize your voice, as of today, I don’t think that exists. Mycroft itself doesn’t have the feature, and the only community plugin I know of that ever tried, it never got anywhere (afaik) and has been abandoned for ages.

jobyjreuben · May 3, 2021, 6:52pm

Hello @ChanceNCounter

Thanks for your reply. To be clear I would like to change/replace Mycroft’s voice with the voice I would like to.

My ultimate aim is to make a feature to clone any voice in the world and make it as their Mycroft’s voice

Using the repo (given below) we can able to clone voices using an original voice, thus convert every text to speech using it.

If I could send the Mycroft’s AI replies(text) to the cloning software and thus making it to convert the replies from text to speech would be possible I think, but it will take speech synthesis time, processing time, etc and also need a good GPU for it. Thus it won’t be feasible.

So I thought of knowing how Mycroft have uploaded their default voices data into the core Mycroft AI program. Thus I can be able to export the cloned voice data and import to Mycroft core. This may reduce all the processing time.

In simple kind of want to export the cloned voice data and import to Mycroft core. So needed to know how Mycroft uploads their voices, etc. File type, all the infos regarding it.

You can check this repo GitHub - CorentinJ/Real-Time-Voice-Cloning: Clone a voice in 5 seconds to generate arbitrary speech in real-time

pcwii · May 3, 2021, 7:53pm

I had a look at that GitHub and the work done there is pretty impressive. In layman’s terms Mycroft takes the Text of whatever it is going to speak then passes that text to a TTS engine (Mimic 1/2, Mary, Google, etc.) once the text is passed to the configured TTS engine, the engine returns with an audio file (wav) of the converted text that mycroft plays to the listener. If the repo you reference has an api to pass it text and have it return a wav file then I guess in theory it could work. More information on mycroft TTS configuration here.
https://mycroft-ai.gitbook.io/docs/using-mycroft-ai/customizations/tts-engine

jobyjreuben · May 3, 2021, 8:08pm

Hello @pcwii Very Thankful for your valuable information.

Yes we will be able to create an Api in the cloning software and transfer the wav file, but the problem here is higher gpu processing, thus for real time this would be a mess

I would like to create the new voice in the tts engine. And I’m thinking of using Mimic 1 to create the new voice, thus I have read it also can run without the network.

Could you give more info on Mimic 1 and how I can utilize to create my own voice?
Also what would be your preference for the best tts engine?

Kindly let me know your thoughts.

baconator · May 3, 2021, 9:48pm

Mimic1 is based on Flite from CMU. I’d suggest skipping that one. Mimic2 is based on Tacotron, but it’s much more complex, and also now out of date, so skip it. Your current best bet is to check out Coqui TTS. You would need to record at least ten hours of carefully annotated voice clips in very high (studio or very near) quality to result in a good model.

jobyjreuben · May 4, 2021, 5:10am

Hello @baconator Thanks for the update

I can able to clone my voice and produce 10 hrs of voice clips, that would be fine for me

Since many tts requires a good GPU to make the voice real time, I would like to know the exact tts which doesn’t requires a gpu and can also run on a mobile hardware too. Will Cloud method would be good?

I would like to make my voice compatible with a usual harware without a gpu, while without a gpu it will be taking several seconds.

Any suggestions? Any cloud tts? Kindly let me know

Thank you

baconator · May 4, 2021, 5:38am

Training takes a gpu. Inference would be doable on an x86 desktop cpu at better than realtime. You could run it on a cloud system as well.

The more quality data you have the better your training will go.

jobyjreuben · May 4, 2021, 12:07pm

@baconator Can you give me any link to know more about Coqui TTS in training, interference, running the new voice?

Also any links on how to connect Coqui TTS with Mycroft ?

Thorsten · May 4, 2021, 3:14pm

In Mycroft docs you can find instructions on how to connect with Coqui TTS.
https://mycroft-ai.gitbook.io/docs/using-mycroft-ai/customizations/tts-engine#coqui-tts

baconator · May 4, 2021, 4:23pm

You’ll need to go here: GitHub - coqui-ai/TTS: 🐸💬 - a deep learning toolkit for Text-to-Speech, battle-tested in research and production
There’s a lot of work involved, but once you’ve gotten the setup and data handling down, isn’t too terrible.

MoffKalast · May 8, 2021, 7:57pm

getting a large enough corpus to generate a good voice is a tremendous undertaking.

You know it’s funny, on one hand we’ve got tech like this which can create lifelike video and audio deepfakes out of 5 minutes of sample data in almost realtime, meanwhile the entire text to speech field of research is somehow struggling to produce anything useful that doesn’t sound like it’s speaking out of breath and through a grate with zero intonation or take half a datacenter to run.

I guess the economic incentives are just too high to actually open source anything recent so we’re stuck with what we have.

baconator · May 8, 2021, 10:16pm

I mean, you’re more than welcome to brush up on recent advances and then release the code derived from them.
Search “end-to-end text to speech” or similar terms. Computer Science authors/titles recent submissions

To be clear the paper you’re referencing also starts with a large dataset to model and then based off of that does the reframing. [2011.10688] Iterative Text-based Editing of Talking-heads Using Neural Retargeting “p. (2) We leverage a large repository of video of a source actor and develop a new self-supervised neural retargeting technique for transferring the mouth motions of the source actor to the target actor.”

MoffKalast · May 9, 2021, 7:17am

Yep, and that’s probably the only practical and reasonable approach. I mean having to record hours and hours of speech every time you just want to change the final output frequencies a little seems rather excessive.

After all then you just need an open and labeled voice dataset like LJSpeech to serve as a baseline and then retarget. Someone who actually knows what they’re doing would need to do that part and open source it of course. I’m sure someone will come along in the next few years eventually.

sparkyvision · May 14, 2021, 5:49pm

I was researching this yesterday while I had very little to do at work. At least some of the answers I found like this one from Stack Overflow seem to indicate that tens of hours of audio aren’t necessarily needed to train a model - the tutorial suggested 100 clips of high-quality audio between 1-10 seconds in length. Will it sound good? Who knows.

The Mozilla (not Mozilla?) / Coqui TTS engine website has a real surplus of explanatory shortage on how to actually train the thing and create a usable model. I have a friend willing to donate a mediocre desktop computer to the cause, and I’m thinking of buying a CUDA GPU and training my own model as well, by paying a voice actor (yes, really) to read all of the Harvard sentences for me. Of course, I’ll try it myself before I commit to spending money, but I want a pleasing voice.

If if actually happens that 500 sentences works, and if I can figure out how to wade through the incredibly dense and difficult-to-understand documentation surrounding Coqui / Moz TTS, then I’d happily donate some time to anyone else who would like a model trained for them.

Dominik · May 14, 2021, 6:18pm

The audio preprocessing script referenced in the stackoverflow article was authored by me

Please join the Mozilla TTS forum and/or follow discussions on the Coqui-TTS repository

sparkyvision · May 14, 2021, 6:23pm

I will, Dominik! Thanks for the suggestion.

baconator · May 15, 2021, 3:30am

That does say “at least” 100 sentences, at 10s each that’s still only 15ish minutes. I queried some voice actors about this once, the cheapest quote I got was $3k for 8 hours (sentences with top 500 words+harvard+100 from technical papers+200 based on queries+simple wiki summary lines, all licensed for repurposing). I would imagine you’d also want to add in the top 100[?] responses from mycroft for good measure, as well as a number of recent technical sentences to help shape those kind of responses. For that price point you’re into an nvidia tesla card and some really good audio gear. The quality of the audio data directly effects the end model…the old GIGO.

The info in the linked post summarizes what you need to know for basic modeling. A GPU with at least 8gb will definitely simplify your life as well if you do get one, though at current prices good luck.

sparkyvision · May 15, 2021, 1:27pm

You’re not wrong. I was looking at actors from Fiverr, and got a quote for about $500, but this was only for the Harvard sentences. Obviously adding more stuff makes the price go up. (But the actor I was thinking of does happen to sound a lot like a classic sci-fi computer voice actor, so that’s a plus.) Before I spend anything, though, I’ll definitely see what the model sounds like with just my voice. I have the ability to record quite good audio. I’m interested to see how good the models can get with 500 sentences, only.

I’m very willing to do a trade off for price / time. I don’t mind having the computer run for a few days if it keeps my costs low, ha.

sparkyvision · May 15, 2021, 6:44pm

I gotta say, the fact that we’re guessing at the numbers we need for a good model is slightly frustrating - I wish the documentation here were a little more straightfoward, particularly in terms of how much data one needs versus the clarity of the model that is produced. But hey, I suppose I can take it as an opportunity to do some basic research for the community!