Reason for outsourcing speech interpretation?

malevolent · November 23, 2017, 5:05pm

Ok, I see your point. And using all mycrofts around there as proxies let me more confident to use the service.
I’ve just opted in for Open Dataset, which I understand is the method you’re explaining here.

Now I must figure how to use Google STT instead mycroft’s one.

jrbauer97 · December 18, 2017, 3:12am

I’m not a Mycroft user or an ML guy, but just a passer by FWIW, Mozilla has recently released a couple hundred hours of English STT based on voices in many accents, along with a model they claim has 6.5% error on a LibriSpeech data set. You can see the announcement post here.

They’re still growing the # of hours and plan to add more languages soon.

steve.penrod · December 23, 2017, 12:09am

@jrbauer97, you are talking about DeepSpeech (https://github.com/mozilla/DeepSpeech) and the Common Voice project. They are tightly related, but different things and we (Mycroft) are working with them already to build this out.

DeepSpeech is the code that can do STT using machine learning approaches. But, like all machine learning systems, it is really useless without data to ‘learn’ from.

Common Voice is an attempt by Mozilla to build a body of recordings they can use to train on. It is a great idea and we encourage you to go to https://voice.mozilla.org and help with this effort. It is building a fully open (CC0 aka public domain) dataset. But this is just a dataset, it doesn’t ‘do’ anything.

Finally there is the model that Mozilla has generated by training DeepSpeech. That is are what you referred to 6.5%. Those models are based on data from Common Voice plus several other dataset (e.g. the Libre Voice dataset, so TED talks, etc).

Mycroft fits in to this in two ways:

We are building another dataset for them to train off of. Our users can participate in this by using the Opt In mechanism. We are building a custom license that is different from any existing license, allowing you to back out of sharing your data. This is different from the Common Voice dataset because once you release something to the public domain, you don’t get to withdraw it.
As DeepSpeech and the trained model we jointly build matures, we are going to integrate it into the Mycroft ecosystem in two ways:
a) DeepSpeech will be the engine we using in on our servers.
b) We will provide tools allowing for a simple setup of DeepSpeech on your own device or private server if you want to host your own STT service and have the computational power available to support it.

The data DeepSpeech is training on right now is decent, but is holding the STT performance back because it is too clean. Most was recorded by someone at a desk using a good mic in a quiet environment. For DeepSpeech to really shine it needs data of people talking in kitchens with a stove fan runnig, coffe shops with lots of chatter in the background, cars with road noise, etc. And it needs lots of that data. That is where we – Mycroft AI and our community – have the unique ability to collaborate and really make this a successful joint effort.

I hope this all makes sense!

jrbauer97 · December 30, 2017, 7:15pm

@steve.penrod Thanks for the response! Glad to hear you are working together with Mozilla on this, when I heard about you guys after reading Mozilla’s initial post about Common Voice/DeepSpeech I thought it sounded like it would be a cool use-case. Interesting to know that the Common Voice dataset is too clean for your needs (but makes sense). I recorded a few samples, but maybe I’ll add some more and turn on a few sources of white noise to dirty it a bit

dottedfish · January 4, 2018, 12:08am

I’m an avid user of Voice Attack and Voicebot. Both work great, based on the Microsoft Speech Recognition API (which you need to train). No cloud involved at all.

That said I’m also here to provide my 50cents about the statement of having to be in the 90%+ ballpark to even see usage. And in my opinion you’re a little in the wrong about that - it will also be a lost war as well as a gate to always use a cloud based STT like amazon, google and so on.
What you need is a way to handle inputs below a certain wanted confidence threshold (would be great if the user could set it).
Why not let Mycroft respond: I didn’t understand you perfectly - did you mean “…”? and use that info to train the system.
Personally I’m totally fine with confidence levels of around 80%. It works absolutely great for me - with just 20 minutes of training it recognizes what I’m saying with a confidence of 80-93% mostly). Keep in mind that I’m not even a native speaker. Native speakers would see an even better experience I’m sure.

Also about that proxy thing for google SST
You’re talking about how that will break the connection. Is that guaranteed?
As in you’re mentioning it’s not user based - what about device identifiers?
How how will you cover the usage costs? From what I’ve read it costs (a lot) https://cloud.google.com/speech/
How can you be sure google SST can’t identify by usage patter, voice and context?

I’m genuinely interested in buying a Mark II (because of the array) device but all this talk about using google/amazon or other cloud SST provides makes me question it in terms of protecting my privacy.

ealbers · January 8, 2018, 12:09pm

I agree with Michael, closed source back end makes the entire project effectively close source. Saying “if a user works hard they could x” is just the same as saying the could write everything themselves.

Remember, these ‘ai’s’ have a huge trust hurdle to cross, if your shipping my 5 year olds speech out somewhere live, or my bedroom conversation is leaving my house, YOU have an issue…trust is not given, its earned, open source earns that trust, ANY closed/hidden bit blows trust away.

KathyReid · January 8, 2018, 1:13pm

We understand.

Mycroft is modular, so you’re able to swap out the default for any Speech to Text engine you prefer, for instace you could build one locally then swap that in.

We’re currently in talks with Mozilla to integrate their DeepSpeech product, which we’re pretty excited about.

Kind regards,
Kathy

neon · January 31, 2018, 7:14pm

Please provide instructions for how to swap out modular pieces, I can’t find how to do that.

More specifically, I have explicit, separate versions of Sphinx/PocketSphinx as well as DeepSpeech running locally and would like to use/try them instead of the server system.

KathyReid · February 1, 2018, 4:51am

Sure.

Most of the modular configuration is done through your home.mycroft.ai account. At the moment, we provide configuration options for STT and TTS, not for Wake Word (ie PocketSphinx). We don’t yet have a DeepSpeech option available to select.

If you want to do more granular configuration, you would need to do this in your mycroft.conf file:

Kind regards,
Kathy

neon · February 1, 2018, 1:30pm

Thanks Kathy!

I am mainly concerned with bringing services back to my own servers. Mycroft speech recognition seems to work quite well, but if I understand correctly, it is not available to run locally. Hence the discussion about the level of “open source”-ness.

From the descriptions you sent, I can’t see how to do the cloud vs. local switching for particular parts.

Actually, my favorite option would be to just run all of mycroft on my own server.

jfred · February 2, 2018, 4:23am

@neon The Mycroft speech recognition is currently a proxy to Google’s speech recognition API and therefore can’t be run locally. However, one of the available options for STT is Kaldi, which is open source and can be be run locally.

It looks like mycroft-core doesn’t support PocketSphinx for STT right now, but there’s a PR open to add it: https://github.com/MycroftAI/mycroft-core/pull/1225

(And it sounds like DeepSpeech support is coming.)

KathyReid · February 2, 2018, 11:24am

DeepSpeech support is coming

Jarbas_Ai · February 2, 2018, 11:28am

it is here https://github.com/MycroftAI/mycroft-core/pull/1370