Reason for outsourcing speech interpretation?


Just my two cents…
I think the fact is we need a OSS alternative to privative tools. Google and Apple has the huge user base who provides them a plenty of acoustic models and LM. We don’t.
I strongly believe Mycroft should partner with OSS companies like TheCorpora, which is working on a OpenSouce Robot. They did 4 years ago a $4000 robot and besides universities and some particulars (like me, who build it from scratch saving half the money), it didn’t suceed at all… :frowning:
Now, they are about to release a Rpi version of the robot, much much cheaper in order to reach a great user base. And they partnered with Canonical, which seems now to empower the robotics world.
What I meant is, Mycroft could partner as well to provide some AI tools and get more user base (i.e: desktop linux users can provide LM’s in their own language) and perhaps with the help of the other fellows (and linux users around the world), we can build an alternative.


TL;DR - Yes, but maybe not how you are thinking

Here is my $0.02, both as a part of Mycroft and as an individual.

With pervasive voice control systems, privacy is a huge concern. I honestly don’t think any of the current players is actually doing anything nefarious, but the possibility of bad things happening is absolutely there. And unlike using a laptop or even cellphone, these systems are being designed with the intention of being able to hear every word you say all the time – that’s what makes them useful.

So more than probably any other technology that has been developed, the ability to verify what the system doing is really important. That is why an Open Source platform makes the most sense. The more open the better.

But I’m a pragmatist. I understand that an actually usable voice control system requires very good Speech to Text (STT) to be anywhere near usable. And if you aren’t in that 90+% accuracy rate for the STT, nobody will use it. Mycroft’s early experiments in OpenSTT built on Kaldi were more in the 80% accuracy. Which sounds decent, but the reality of that figure is that 1 out of every 5 times you try to use the system it’ll be wrong. It won’t take long for that system to be abandoned, no matter how dedicated you are to Open Source.

So we need better Open Source STT, right? But you can’t just make that materialize. You need lots and lots of voice samples in different noise environments and with different pitches and accents to train on. So we have a classic Catch-22 – until the OpenSTT is better, nobody will want to use it. But you need people to use it to be able to gather data to make it better.

So in the short term, we are being pragmatic and using what is undeniably a very good STT engine at Google. But we are protecting privacy by breaking the connection between your Google identity and your speech to text commands/queries. We do this by being a proxy – all they Google system knows is that some Mycroft device is requesting a STT translation, but they don’t know which Mycroft device or (most importantly) which user. To me as an individual, that is pretty reasonable privacy.

Then we can begin capturing (with user permission and anonymizing it) the voices and results of the Google STT to gather those mounds of voice data needed to train OpenSTT to make it better. THAT is the turning point in OpenSTT which breaks the Catch-22.

So yes, we are working on OpenSTT although much of it is the indirect effort of creating the environment and mechanisms we need to collect the data so we can build it.


P.S. We also have agreements with Canonical to bring Mycroft in to Ubuntu. Which will bring in more users which will bring in more data which will speed the whole process of data collection up massively.


User permission aside, does whatever agreement you have with Google permit this kind of reverse engineering?


There is a little bit of a gray area in the terms of service, but from my reading of them we would not be in violation of the terms by what I described. Worst case, we can absolutely store the voice data we gather before it goes to them for STT (with the user’s permission, of course). We’d just have to use human effort to transcribe that voice for later training purposes. There are a bunch of schemes we could use to simplify this, too. For example we could have a Skill which occasionally asks people to repeat phrase we give them. I’m honestly not too worried about that piece.


This gets me thinking about recaptcha.

How about a new plugin that has people read text. STT can convert and match. It will fail 80% of the time, but that already happens when I try to type in the text :). One nice aspect is the text can be localized so getting samples of many languages is possible. Also the plugin can be distributed widely.


I mostly agree with your post but the portion about acceptable privacy. Google does know exactly whos mycroft device the audio belongs to because just like image identification, there has been voice identification built into these large corporate systems. It doesnt matter if everything was through a proxy and they didnt know it was a mycroft device, they can identify and catalog a users home audio clips just like they do images and other metadata.

Here is a possible solution to improving accuracy of Kaldi


Ok, I see your point. And using all mycrofts around there as proxies let me more confident to use the service.
I’ve just opted in for Open Dataset, which I understand is the method you’re explaining here. :slight_smile:

Now I must figure how to use Google STT instead mycroft’s one.


I’m not a Mycroft user or an ML guy, but just a passer by :slight_smile: FWIW, Mozilla has recently released a couple hundred hours of English STT based on voices in many accents, along with a model they claim has 6.5% error on a LibriSpeech data set. You can see the announcement post here.

They’re still growing the # of hours and plan to add more languages soon.


@jrbauer97, you are talking about DeepSpeech ( and the Common Voice project. They are tightly related, but different things and we (Mycroft) are working with them already to build this out.

DeepSpeech is the code that can do STT using machine learning approaches. But, like all machine learning systems, it is really useless without data to ‘learn’ from.

Common Voice is an attempt by Mozilla to build a body of recordings they can use to train on. It is a great idea and we encourage you to go to and help with this effort. It is building a fully open (CC0 aka public domain) dataset. But this is just a dataset, it doesn’t ‘do’ anything.

Finally there is the model that Mozilla has generated by training DeepSpeech. That is are what you referred to 6.5%. Those models are based on data from Common Voice plus several other dataset (e.g. the Libre Voice dataset, so TED talks, etc).

Mycroft fits in to this in two ways:

  1. We are building another dataset for them to train off of. Our users can participate in this by using the Opt In mechanism. We are building a custom license that is different from any existing license, allowing you to back out of sharing your data. This is different from the Common Voice dataset because once you release something to the public domain, you don’t get to withdraw it.
  2. As DeepSpeech and the trained model we jointly build matures, we are going to integrate it into the Mycroft ecosystem in two ways:
    a) DeepSpeech will be the engine we using in on our servers.
    b) We will provide tools allowing for a simple setup of DeepSpeech on your own device or private server if you want to host your own STT service and have the computational power available to support it.

The data DeepSpeech is training on right now is decent, but is holding the STT performance back because it is too clean. Most was recorded by someone at a desk using a good mic in a quiet environment. For DeepSpeech to really shine it needs data of people talking in kitchens with a stove fan runnig, coffe shops with lots of chatter in the background, cars with road noise, etc. And it needs lots of that data. That is where we – Mycroft AI and our community – have the unique ability to collaborate and really make this a successful joint effort.

I hope this all makes sense!


@steve.penrod Thanks for the response! Glad to hear you are working together with Mozilla on this, when I heard about you guys after reading Mozilla’s initial post about Common Voice/DeepSpeech I thought it sounded like it would be a cool use-case. Interesting to know that the Common Voice dataset is too clean for your needs (but makes sense). I recorded a few samples, but maybe I’ll add some more and turn on a few sources of white noise to dirty it a bit :wink:


I’m an avid user of Voice Attack and Voicebot. Both work great, based on the Microsoft Speech Recognition API (which you need to train). No cloud involved at all.

That said I’m also here to provide my 50cents about the statement of having to be in the 90%+ ballpark to even see usage. And in my opinion you’re a little in the wrong about that - it will also be a lost war as well as a gate to always use a cloud based STT like amazon, google and so on.
What you need is a way to handle inputs below a certain wanted confidence threshold (would be great if the user could set it).
Why not let Mycroft respond: I didn’t understand you perfectly - did you mean “…”? and use that info to train the system.
Personally I’m totally fine with confidence levels of around 80%. It works absolutely great for me - with just 20 minutes of training it recognizes what I’m saying with a confidence of 80-93% mostly). Keep in mind that I’m not even a native speaker. Native speakers would see an even better experience I’m sure.

Also about that proxy thing for google SST
You’re talking about how that will break the connection. Is that guaranteed?
As in you’re mentioning it’s not user based - what about device identifiers?
How how will you cover the usage costs? From what I’ve read it costs (a lot)
How can you be sure google SST can’t identify by usage patter, voice and context?

I’m genuinely interested in buying a Mark II (because of the array) device but all this talk about using google/amazon or other cloud SST provides makes me question it in terms of protecting my privacy.


I agree with Michael, closed source back end makes the entire project effectively close source. Saying “if a user works hard they could x” is just the same as saying the could write everything themselves.

Remember, these ‘ai’s’ have a huge trust hurdle to cross, if your shipping my 5 year olds speech out somewhere live, or my bedroom conversation is leaving my house, YOU have an issue…trust is not given, its earned, open source earns that trust, ANY closed/hidden bit blows trust away.


We understand.

Mycroft is modular, so you’re able to swap out the default for any Speech to Text engine you prefer, for instace you could build one locally then swap that in.

We’re currently in talks with Mozilla to integrate their DeepSpeech product, which we’re pretty excited about.

Kind regards,


Please provide instructions for how to swap out modular pieces, I can’t find how to do that.

More specifically, I have explicit, separate versions of Sphinx/PocketSphinx as well as DeepSpeech running locally and would like to use/try them instead of the server system.



Most of the modular configuration is done through your account. At the moment, we provide configuration options for STT and TTS, not for Wake Word (ie PocketSphinx). We don’t yet have a DeepSpeech option available to select.

If you want to do more granular configuration, you would need to do this in your mycroft.conf file:

Kind regards,


Thanks Kathy!

I am mainly concerned with bringing services back to my own servers. Mycroft speech recognition seems to work quite well, but if I understand correctly, it is not available to run locally. Hence the discussion about the level of “open source”-ness.

From the descriptions you sent, I can’t see how to do the cloud vs. local switching for particular parts.

Actually, my favorite option would be to just run all of mycroft on my own server.


@neon The Mycroft speech recognition is currently a proxy to Google’s speech recognition API and therefore can’t be run locally. However, one of the available options for STT is Kaldi, which is open source and can be be run locally.

It looks like mycroft-core doesn’t support PocketSphinx for STT right now, but there’s a PR open to add it:

(And it sounds like DeepSpeech support is coming.)


DeepSpeech support is coming :slight_smile:


it is here