Reason for outsourcing speech interpretation?

Sorry, I couldn’t find a more appropriate section that would accept a post. A question I’ve wanted to ask since I listened to the Linux Unplugged interview is the reason behind using cloud services for speech interpretation (i.e., when you give Mycroft a verbal command, it sends the audio to the clould for conversion into an instruction that Mycroft can process. This may make some users and potential customers uncomfortable. I know Ryan has made a statement that it is something that mycroft.ai hopes to address in the future, but in the meantime, is this a tech or a license issue? In other words, is the necessary software not available open source or is the RPi 2 just not powerful enought to process speech in real time? If it is the second reason, could one run the speech to text coversion on a more powerful local machine, like an i5 or i7, or even a private cloud server?

1 Like

@FiftyOneFifty it’s a tech issue, Pocketsphinx (that runs locally to recognize the wakeword “Mycroft”) is not really all that powerful and accurate (with a focus on accuracy).

However, if someone is on the forums is a STT rockstar and has some good feedback in this area - that would be good. But I want the community to bear in mind that just because Mycroft does remote speech-to-text doesn’t mean that it is insecure or storing your info. We are working on a backend that is open source and respects your privacy. But, as I mentioned before, we are not opposed to community feedback and members who want to offer their expertise in this area.

2 Likes

Hey @FiftyOneFifty! Thanks for the question. As always with these types of questions, the answer is “a little of everything.”

As I mentioned in a previous post, there are two primary styles of speech recognition. First (and most commonly provided to developers) uses a pre-defined grammar definition, and all things ever to be recognized must fit into this grammar. The existence of the grammar itself limits the scope of all things recognizable, makes updating the vocabulary live difficult, and the sheer size of the grammar (imagine adding all music artists so you can request any style of music from spotify) can be problematic, even with more lax space constraints of a desktop pc. There are commercial offerings in this space (most notably Nuance), but they are not free, and definitely not OSS.

The second style of speech recognition is called “dictation,” and as the name suggests it’s for transcribing general purpose dictation. This is the type of thing you use on a daily basis in your smartphone (via google now or siri). Nuance is (again) a competitor in the space, and their tech was rumored to back Siri, though I would guess that APPL has taken a lot of that in-house. There isn’t (to my knowledge) a high-quality open-source dictation recognizer available, and it would require significant specialized experience in the community to create one. If we can find one, running it off-device would probably be a requirement as well.

For either of these scenarios, there are two large datasets that need to be collected: an acoustic model, and a language model. The acoustic model is a standardized catalog of sounds in a language (or language subset), and the language model is a catalog of how those sounds form words. <\endOverSimplifiedExplanation> The wikipedia article on speech recognition is a good place to start off for learning about this stuff: https://en.wikipedia.org/wiki/Speech_recognition . Creating both sets of models requires a large amount of data, and knowledge of a specialized set of tools. Most of these tools are out of academia, and not particularly well packaged, so just getting them running can be a struggle. There are some public domain data sets available (like an english acoustic model of a male reading the wall street journal), but they’re few and far between.

So, in summation, high quality data sets can require a lot of resources (mem and cpu), the OSS tech for STT may or may not be of sufficient quality to make mycroft a good experience (we’re still investigating OSS solutions), and data. Always with the data!

As @ryanleesipes has mentioned, we’ll be using Pocketsphinx for local wake-word recognition (which is pretty limited in its capabilities), and then kicking off-device for dictation TTS. Personally, I intend to make that latter part plug-and-play, so users can switch between Mycroft, GOOG, AMZN, or any other provider they’d like to use. The trade-off for you will be quality of recognition vs privacy concerns, as GOOG and others have superior tech/resources, and likely will for the foreseeable future.

2 Likes

So, which service does MyCroft use by default?

Hoping to use our own STT backend by default.

I assume mycroft will be using mutiple microphones as attaining high accuracy with a single not-so-expensive microphone is ver difficult. That’s where UBI fails.
So all the audio processing (multi-channel noise reduction, filtering etc) is going to be done on the cloud?
Wouldn’t it add to the data usage ??

In other words, mycroft is not completely open source. The backend is closed source which to me is dishonest. But you must have chosen to omit this in thr ks bc you would have recieved less funding. Gg

1 Like

I am using Dragon 14 Professional for my speech recognition and it works fantastic. I use a Microsoft Kinect 1st gen as my microphone. I have had excellent results. Granted I am using this with another A.I. entity but still…

@Michael_Speth as I mentioned in the other post. We are using what we can while we develop an open model via the OpenSTT initiative. This is an extremely difficult task to undertake so we are using what we can as we pour development time and money into developing an open source alternative.

If you would like to help we would appreciate it, but your assumptions about us seeking to swindle people is offensive to me.

2 Likes

Take a look at the Julius recognizer. When I worked with it, it seemed pretty fast and runs on a Raspberry PI. Only catch is the free dictionaries it comes with are for Japanese, which I speak well enough for testing purposes. It is open source though.

Do you really think you need a dictation -type recognizer though?

I’m also trying to integrate Kinect with A.I. but I’m struggled with MS SDK - most examples are in C# and I’m more into Python. What programming language do you use to interact with Kinect? Is it possible to run it on Linux? As far as I’m concerned one is stuck to MS Windows.

Nevertheless, that’s what happened. People backed it on the basis of a description that isn’t valid. I used the contact form on mycroft.ai to attempt to get an answer, but haven’t received a response in the week since.

Nowhere is there a place to cancel an order - how do we do that? Until there’s a clean back-end that doesn’t require a Google login, I’m not interested in buying - and would not have if the Google ties were clear from the beginning.

1 Like

Are you still working on the OpenSTT project? Because until that is done this is not an open source project…

Just my two cents…
I think the fact is we need a OSS alternative to privative tools. Google and Apple has the huge user base who provides them a plenty of acoustic models and LM. We don’t.
I strongly believe Mycroft should partner with OSS companies like TheCorpora, which is working on a OpenSouce Robot. They did 4 years ago a $4000 robot and besides universities and some particulars (like me, who build it from scratch saving half the money), it didn’t suceed at all… :frowning:
Now, they are about to release a Rpi version of the robot, much much cheaper in order to reach a great user base. And they partnered with Canonical, which seems now to empower the robotics world.
What I meant is, Mycroft could partner as well to provide some AI tools and get more user base (i.e: desktop linux users can provide LM’s in their own language) and perhaps with the help of the other fellows (and linux users around the world), we can build an alternative.

TL;DR - Yes, but maybe not how you are thinking

Here is my $0.02, both as a part of Mycroft and as an individual.

With pervasive voice control systems, privacy is a huge concern. I honestly don’t think any of the current players is actually doing anything nefarious, but the possibility of bad things happening is absolutely there. And unlike using a laptop or even cellphone, these systems are being designed with the intention of being able to hear every word you say all the time – that’s what makes them useful.

So more than probably any other technology that has been developed, the ability to verify what the system doing is really important. That is why an Open Source platform makes the most sense. The more open the better.

But I’m a pragmatist. I understand that an actually usable voice control system requires very good Speech to Text (STT) to be anywhere near usable. And if you aren’t in that 90+% accuracy rate for the STT, nobody will use it. Mycroft’s early experiments in OpenSTT built on Kaldi were more in the 80% accuracy. Which sounds decent, but the reality of that figure is that 1 out of every 5 times you try to use the system it’ll be wrong. It won’t take long for that system to be abandoned, no matter how dedicated you are to Open Source.

So we need better Open Source STT, right? But you can’t just make that materialize. You need lots and lots of voice samples in different noise environments and with different pitches and accents to train on. So we have a classic Catch-22 – until the OpenSTT is better, nobody will want to use it. But you need people to use it to be able to gather data to make it better.

So in the short term, we are being pragmatic and using what is undeniably a very good STT engine at Google. But we are protecting privacy by breaking the connection between your Google identity and your speech to text commands/queries. We do this by being a proxy – all they Google system knows is that some Mycroft device is requesting a STT translation, but they don’t know which Mycroft device or (most importantly) which user. To me as an individual, that is pretty reasonable privacy.

Then we can begin capturing (with user permission and anonymizing it) the voices and results of the Google STT to gather those mounds of voice data needed to train OpenSTT to make it better. THAT is the turning point in OpenSTT which breaks the Catch-22.

So yes, we are working on OpenSTT although much of it is the indirect effort of creating the environment and mechanisms we need to collect the data so we can build it.

3 Likes

P.S. We also have agreements with Canonical to bring Mycroft in to Ubuntu. Which will bring in more users which will bring in more data which will speed the whole process of data collection up massively.

1 Like

User permission aside, does whatever agreement you have with Google permit this kind of reverse engineering?

There is a little bit of a gray area in the terms of service, but from my reading of them we would not be in violation of the terms by what I described. Worst case, we can absolutely store the voice data we gather before it goes to them for STT (with the user’s permission, of course). We’d just have to use human effort to transcribe that voice for later training purposes. There are a bunch of schemes we could use to simplify this, too. For example we could have a Skill which occasionally asks people to repeat phrase we give them. I’m honestly not too worried about that piece.

This gets me thinking about recaptcha.

How about a new plugin that has people read text. STT can convert and match. It will fail 80% of the time, but that already happens when I try to type in the text :). One nice aspect is the text can be localized so getting samples of many languages is possible. Also the plugin can be distributed widely.

I mostly agree with your post but the portion about acceptable privacy. Google does know exactly whos mycroft device the audio belongs to because just like image identification, there has been voice identification built into these large corporate systems. It doesnt matter if everything was through a proxy and they didnt know it was a mycroft device, they can identify and catalog a users home audio clips just like they do images and other metadata.

Here is a possible solution to improving accuracy of Kaldi