Alternative tts engine?

Hi, totally new to mycroft here. I’m wondering if apis exist to add on a new tts engine… I would love to see if it would be possible to integrate wellsaid labs… The absolute best tts I’ve ever heard. Ever. Here’s a sample of what you all are missing :wink: https://drive.google.com/file/d/1cjYRrU05bIGVRc8p3e6-Pf-Ck2CNhukC/view?usp=drivesdk

Kinda pricey:


…and everything that should lead to useful info goes directly to “contact sales”. They’re very corporate-focused. Their blog seems to indicate they’re mainly about larger latency products than pure TTS engine.

Interesting that none of the samples they provide are on quiet backgrounds, makes it much harder to properly eval how good they sound.

Can’t find anything on an api call for them, or latency. There’s nothing on github for api usage currently, which is rather worrisome.

To answer your main query, anything is possible if you can program it to interface with the messagebus. Look at the mimic2 tts piece, it can basically handle sending text to an endpoint and expecting a wav file back, or the google tts one, which does auth as well.

Looks like they do some kind of tacotron2 + wavenet stuff. CTO Petrochuk: “We took research like Tacotron and pushed it even further…” His github has a bunch of tts projects starred.

hey @baconator, I’m definitely curious about what’s under the hood. I’ve got a subscription and I’ve been using them to generate PBX voice prompts for our customers. You really can’t tell that it’s not a human, especially on the other end of an 8khz call. As for the API portion, you need to request access, but they do have it. They want to basically determine that you’re not reselling their service as something identical. I did notice some interesting things as I’ve worked with the files generated by them - identical phrases, for example “press 1 for so and so, press 2 for so and so” are generated with slight differences in intonation and pace (I’m no speech expert here so forgive my basic terminology). This really makes it sound very human-like. It’s the best I’ve found as of yet. I looked into tacotron/wavenet, and yep, some of those examples are almost on the level of WSLTTS, but not quite. I would say it seems like wellsaid has polished that technology. I can get you any samples you’d like from wellsaid (I have access to 4 voices with my low-end plan, but they have many more you can unlock if you pay more).

I’m curious as to why mainstream TTS such as google, amazon etc still does not compete with the quality of WSLTTS. Here are some great side-by-sides:


@baconator I’ve thrown your initial reply into WSLTTS and generated output without any fixes to the pronunciation. A few things need to be tweaked, but this is easy to do. I wanted you to hear the un-touched output though: https://drive.google.com/file/d/1BtgDkGesHbjcpQXMPisnqu3NMfVcpoNl/view?usp=sharing

Sounds very much like a tacotron backed engine and some vocoder. Note the skips in acronyms, for instance.

How’s latency on the api end?

Check out google duplex and you’ll see how close they are.

Their voice actor demo is actually just LJspeech: https://keithito.com/LJ-Speech-Dataset/

Oh, they’ve also done one of those marketing promos on that first video.
https://cloud.google.com/text-to-speech/docs/wavenet

I don’t think Google wave net sounds anywhere near as lifelike as wsltts. I’ve actually got a Google assistant device in the kitchen (don’t shoot me - I haven’t had a chance to dabble with mycroft yet) and the rest of my family uses it regularly. I can say without a doubt that although very good, it’s voice isn’t as life like as wsltts. Now duplex - I saw that demo and it slipped my mind! Yeah that is just unbelievable. Having that with a voice assistant like mycroft would be just nuts! Any idea how they got to that point? I appreciate your insight. I have an interest in deep learning but just don’t have the time right now to immerse myself in it. When I eventually do have some time, I’d like to have a lay of the land already. I also like to know what the current state of the art (or at least most popular at the time) is in each of the common areas… Such as yolo for object recognition. By the way, pyimagesearch.com is an awesome resource for hands-on learning of computer vision and deep learning topics.

As for latency on the wsltts api, I can’t speak to it yet as I haven’t obtained access yet. I can say that using the web interface to generate files takes a bit of time. The snippet I last posted for you probably took around 15 seconds or so to prepare the file download. But I’m not sure how much of that can be chalked up to whatever queuing system they have in place to generate files. I’m going to request api access and I’ll let you know how that goes.

On my computer speakers it sounds like the better demos of Nancy and LJSpeech you find when looking for Mozilla-TTS, Tacotron2, WaveNet, WaveGlow. Maybe WSLTTS does sound a bit better when listening on really good loudspeakers or headphones - but that would not be Mycrofts use case scenario…

When you like that you might want to look into Nvidia’s FlowTron - they have some very interesting parameters for variation and style.

1 Like