The Most Important Thing

IMHO, the most important thing that MyCroft could possibly do is move to a local speech-to-text model. Get us all off “the cloud” and the WAN every time we say something.

Hard, you say? My GPS, a Garmin, which I bought about ten years ago, understands me pretty well, and it’s not connected to anything other than the listening it does for the GPS satellite constellation. I can name any town or street or address combination and as long as I speak clearly, it gets it right almost all of the time. So the tech is out there, and has been for a while. If Garmin (and Dragon, etc.) could do it (and they definitely did), so can others.

Speaking as a developer, my interest in MyCroft would skyrocket if it worked reasonably well in a LAN-only context. Speaking as a user, I’ve already got devices that look to the cloud for speech. Adding yet another, and thereby supporting the cloud model further, has very little appeal.

Local STT is the killer feature that could serve to make MyCroft (finally) stand out even among the big shots. I’d love to see that, as the combination of an open development system and real privacy and security based on LAN-only operation seems unbeatable to me.

1 Like

Garmin’s offline voice recognition works surprisingly well. I love my Garmin GPS and everytime I use it I’m amazed that it can recognize my voice commands and complex addresses all while offline and in a noisy environment.

I would disagree about localizing TTS or STT because it seems better to have a cloud server do the processing with continuous improvements/updates. The latest cutting edge TTS and STT uses neural networks and is always getting better.

My offline Garmin works great but is not as good as in TTS or STT as Google Assistant or Amazon Polly.

I think this is a pretty popular opinion, and there has been talk about it before. I’d like to get STT working either locally or at least on a LAN server as well. I will probably pursue this on my own unless something public comes out. It looks like there is at least the beginnings of support for Kaldi and DeepSpeech, so I am going to setup a LAN server to run those first and see how that works.

Maybe Mycroft can open source their DeepSpeech based server API? Or they have and I just haven’t seen it yet. So we can utilize that system locally, and it will be the same system Mycroft currently uses just LAN based not WAN. It’d be nice if they also released periodic updates to the training model that we could download. The same ones they are training and improving by utilizing all the Mycroft devices hitting their server currently.

1 Like

Private STT models and tools cost a lot of money to license.

Offline/local solutions exist if you want to try them, deep speech being one of the leading choices at the moment. So far this week it’s heard one of twenty phrases I’ve asked mycroft correctly. It’s getting better at least. Also it uses a machine with a dedicated gpu for that effort. Pocketsphinx can run on less hardware, and might be an easier one to implement. Kaldi another good choice, also takes hardware to make it work well.

Deepspeech is from mozilla, you can check it out here:

Instructions for using including downloading their models can be found there as well.

I know it is, I meant the external server API that wraps DeepSpeech. The API the Mycroft device calls on the Mycroft servers. Bundle that with the DeepSpeech distribution and release it to the public IMO, unless it already is and I’m just blind.

There’s this: and mozilla indicated a while ago they were working on something similar possibly. Haven’t seen any followup to that.

I use the ds-server, it works rather well, I dumped tens of thousands of audio clips through it for transcription and it worked well as long as they weren’t too long. Deepspeech .2 (haven’t gotten to test with .3 or .4 yet) went much quicker than .1.

Will check it out, thanks

Hi there @fyngyrz, thanks for your feedback.

We agree, we’d really like to have an on-device, local STT option, however at this time we’ve decided that long term DeepSpeech is what we will be aiming to use as our default STT. There has been some progress towards getting DeepSpeech to run on embedded-level hardware (such as the armh architecture Raspberry Pi runs on) - you can see more at;

@baconator has provided a great overview of other options - one of the architectural decisions that was made with Mycroft early on was that it would be modular - allowing you to choose which Wake Word listener, STT layer and TTS layer you preferred - so you’re very welcome to try out other STT options with Mycroft.

For now, what you can do to help DeepSpeech progress is assist us in improving accuracy by training DeepSpeech at;