Reason for outsourcing speech interpretation?

I assume mycroft will be using mutiple microphones as attaining high accuracy with a single not-so-expensive microphone is ver difficult. That’s where UBI fails.
So all the audio processing (multi-channel noise reduction, filtering etc) is going to be done on the cloud?
Wouldn’t it add to the data usage ??

In other words, mycroft is not completely open source. The backend is closed source which to me is dishonest. But you must have chosen to omit this in thr ks bc you would have recieved less funding. Gg

1 Like

I am using Dragon 14 Professional for my speech recognition and it works fantastic. I use a Microsoft Kinect 1st gen as my microphone. I have had excellent results. Granted I am using this with another A.I. entity but still…

@Michael_Speth as I mentioned in the other post. We are using what we can while we develop an open model via the OpenSTT initiative. This is an extremely difficult task to undertake so we are using what we can as we pour development time and money into developing an open source alternative.

If you would like to help we would appreciate it, but your assumptions about us seeking to swindle people is offensive to me.

2 Likes

Take a look at the Julius recognizer. When I worked with it, it seemed pretty fast and runs on a Raspberry PI. Only catch is the free dictionaries it comes with are for Japanese, which I speak well enough for testing purposes. It is open source though.

Do you really think you need a dictation -type recognizer though?

I’m also trying to integrate Kinect with A.I. but I’m struggled with MS SDK - most examples are in C# and I’m more into Python. What programming language do you use to interact with Kinect? Is it possible to run it on Linux? As far as I’m concerned one is stuck to MS Windows.

Nevertheless, that’s what happened. People backed it on the basis of a description that isn’t valid. I used the contact form on mycroft.ai to attempt to get an answer, but haven’t received a response in the week since.

Nowhere is there a place to cancel an order - how do we do that? Until there’s a clean back-end that doesn’t require a Google login, I’m not interested in buying - and would not have if the Google ties were clear from the beginning.

1 Like

Are you still working on the OpenSTT project? Because until that is done this is not an open source project…

Just my two cents…
I think the fact is we need a OSS alternative to privative tools. Google and Apple has the huge user base who provides them a plenty of acoustic models and LM. We don’t.
I strongly believe Mycroft should partner with OSS companies like TheCorpora, which is working on a OpenSouce Robot. They did 4 years ago a $4000 robot and besides universities and some particulars (like me, who build it from scratch saving half the money), it didn’t suceed at all… :frowning:
Now, they are about to release a Rpi version of the robot, much much cheaper in order to reach a great user base. And they partnered with Canonical, which seems now to empower the robotics world.
What I meant is, Mycroft could partner as well to provide some AI tools and get more user base (i.e: desktop linux users can provide LM’s in their own language) and perhaps with the help of the other fellows (and linux users around the world), we can build an alternative.

TL;DR - Yes, but maybe not how you are thinking

Here is my $0.02, both as a part of Mycroft and as an individual.

With pervasive voice control systems, privacy is a huge concern. I honestly don’t think any of the current players is actually doing anything nefarious, but the possibility of bad things happening is absolutely there. And unlike using a laptop or even cellphone, these systems are being designed with the intention of being able to hear every word you say all the time – that’s what makes them useful.

So more than probably any other technology that has been developed, the ability to verify what the system doing is really important. That is why an Open Source platform makes the most sense. The more open the better.

But I’m a pragmatist. I understand that an actually usable voice control system requires very good Speech to Text (STT) to be anywhere near usable. And if you aren’t in that 90+% accuracy rate for the STT, nobody will use it. Mycroft’s early experiments in OpenSTT built on Kaldi were more in the 80% accuracy. Which sounds decent, but the reality of that figure is that 1 out of every 5 times you try to use the system it’ll be wrong. It won’t take long for that system to be abandoned, no matter how dedicated you are to Open Source.

So we need better Open Source STT, right? But you can’t just make that materialize. You need lots and lots of voice samples in different noise environments and with different pitches and accents to train on. So we have a classic Catch-22 – until the OpenSTT is better, nobody will want to use it. But you need people to use it to be able to gather data to make it better.

So in the short term, we are being pragmatic and using what is undeniably a very good STT engine at Google. But we are protecting privacy by breaking the connection between your Google identity and your speech to text commands/queries. We do this by being a proxy – all they Google system knows is that some Mycroft device is requesting a STT translation, but they don’t know which Mycroft device or (most importantly) which user. To me as an individual, that is pretty reasonable privacy.

Then we can begin capturing (with user permission and anonymizing it) the voices and results of the Google STT to gather those mounds of voice data needed to train OpenSTT to make it better. THAT is the turning point in OpenSTT which breaks the Catch-22.

So yes, we are working on OpenSTT although much of it is the indirect effort of creating the environment and mechanisms we need to collect the data so we can build it.

3 Likes

P.S. We also have agreements with Canonical to bring Mycroft in to Ubuntu. Which will bring in more users which will bring in more data which will speed the whole process of data collection up massively.

1 Like

User permission aside, does whatever agreement you have with Google permit this kind of reverse engineering?

There is a little bit of a gray area in the terms of service, but from my reading of them we would not be in violation of the terms by what I described. Worst case, we can absolutely store the voice data we gather before it goes to them for STT (with the user’s permission, of course). We’d just have to use human effort to transcribe that voice for later training purposes. There are a bunch of schemes we could use to simplify this, too. For example we could have a Skill which occasionally asks people to repeat phrase we give them. I’m honestly not too worried about that piece.

This gets me thinking about recaptcha.

How about a new plugin that has people read text. STT can convert and match. It will fail 80% of the time, but that already happens when I try to type in the text :). One nice aspect is the text can be localized so getting samples of many languages is possible. Also the plugin can be distributed widely.

I mostly agree with your post but the portion about acceptable privacy. Google does know exactly whos mycroft device the audio belongs to because just like image identification, there has been voice identification built into these large corporate systems. It doesnt matter if everything was through a proxy and they didnt know it was a mycroft device, they can identify and catalog a users home audio clips just like they do images and other metadata.

Here is a possible solution to improving accuracy of Kaldi

Ok, I see your point. And using all mycrofts around there as proxies let me more confident to use the service.
I’ve just opted in for Open Dataset, which I understand is the method you’re explaining here. :slight_smile:

Now I must figure how to use Google STT instead mycroft’s one.

I’m not a Mycroft user or an ML guy, but just a passer by :slight_smile: FWIW, Mozilla has recently released a couple hundred hours of English STT based on voices in many accents, along with a model they claim has 6.5% error on a LibriSpeech data set. You can see the announcement post here.

They’re still growing the # of hours and plan to add more languages soon.

1 Like

@jrbauer97, you are talking about DeepSpeech (https://github.com/mozilla/DeepSpeech) and the Common Voice project. They are tightly related, but different things and we (Mycroft) are working with them already to build this out.

DeepSpeech is the code that can do STT using machine learning approaches. But, like all machine learning systems, it is really useless without data to ‘learn’ from.

Common Voice is an attempt by Mozilla to build a body of recordings they can use to train on. It is a great idea and we encourage you to go to https://voice.mozilla.org and help with this effort. It is building a fully open (CC0 aka public domain) dataset. But this is just a dataset, it doesn’t ‘do’ anything.

Finally there is the model that Mozilla has generated by training DeepSpeech. That is are what you referred to 6.5%. Those models are based on data from Common Voice plus several other dataset (e.g. the Libre Voice dataset, so TED talks, etc).

Mycroft fits in to this in two ways:

  1. We are building another dataset for them to train off of. Our users can participate in this by using the Opt In mechanism. We are building a custom license that is different from any existing license, allowing you to back out of sharing your data. This is different from the Common Voice dataset because once you release something to the public domain, you don’t get to withdraw it.
  2. As DeepSpeech and the trained model we jointly build matures, we are going to integrate it into the Mycroft ecosystem in two ways:
    a) DeepSpeech will be the engine we using in on our servers.
    b) We will provide tools allowing for a simple setup of DeepSpeech on your own device or private server if you want to host your own STT service and have the computational power available to support it.

The data DeepSpeech is training on right now is decent, but is holding the STT performance back because it is too clean. Most was recorded by someone at a desk using a good mic in a quiet environment. For DeepSpeech to really shine it needs data of people talking in kitchens with a stove fan runnig, coffe shops with lots of chatter in the background, cars with road noise, etc. And it needs lots of that data. That is where we – Mycroft AI and our community – have the unique ability to collaborate and really make this a successful joint effort.

I hope this all makes sense!

2 Likes

@steve.penrod Thanks for the response! Glad to hear you are working together with Mozilla on this, when I heard about you guys after reading Mozilla’s initial post about Common Voice/DeepSpeech I thought it sounded like it would be a cool use-case. Interesting to know that the Common Voice dataset is too clean for your needs (but makes sense). I recorded a few samples, but maybe I’ll add some more and turn on a few sources of white noise to dirty it a bit :wink:

1 Like

I’m an avid user of Voice Attack and Voicebot. Both work great, based on the Microsoft Speech Recognition API (which you need to train). No cloud involved at all.

That said I’m also here to provide my 50cents about the statement of having to be in the 90%+ ballpark to even see usage. And in my opinion you’re a little in the wrong about that - it will also be a lost war as well as a gate to always use a cloud based STT like amazon, google and so on.
What you need is a way to handle inputs below a certain wanted confidence threshold (would be great if the user could set it).
Why not let Mycroft respond: I didn’t understand you perfectly - did you mean “…”? and use that info to train the system.
Personally I’m totally fine with confidence levels of around 80%. It works absolutely great for me - with just 20 minutes of training it recognizes what I’m saying with a confidence of 80-93% mostly). Keep in mind that I’m not even a native speaker. Native speakers would see an even better experience I’m sure.

Also about that proxy thing for google SST
You’re talking about how that will break the connection. Is that guaranteed?
As in you’re mentioning it’s not user based - what about device identifiers?
How how will you cover the usage costs? From what I’ve read it costs (a lot) https://cloud.google.com/speech/
How can you be sure google SST can’t identify by usage patter, voice and context?

I’m genuinely interested in buying a Mark II (because of the array) device but all this talk about using google/amazon or other cloud SST provides makes me question it in terms of protecting my privacy.