Always Listening While Always Parsing

the paranoid end user likely wouldn’t be using Mycroft in the first place fearing a shadow government may be listening to spoken near an adjacent mic. thing is thing is privacy is antiquated, a commodity that could be kept back in the days when light bulbs, mp3 players, and dog collars were not connected to the internet

Ok, we’re now 2 years down the road. Where does this idea stand with the current progress Mycroft and Raspberry Pis have made?

Whatever progress you’ve made is going to be it.

There is Kaldi-spotter by @JarbasAl which goes in the direction of „always listenung“. You can define phrases which trigger intents without using the hotword.

1 Like

Disclaimer, I haven’t read this whole thread, however I can pretty confidently say this won’t be coming to Mycroft (at least in the next few years).

Using a local STT engine like Kaldi and defining a limited number of phrases (as Jarbas has) means you could achieve a portion of this. However a broad all listening and all understanding assistant is economically and technically not feasible (on current hardware) at this time.

If you take a look at the cost per minute of STT transcription from even the largest players like Google, this roughly shows how expensive it would be for Mycroft to be performing this. Once more STT moves on device this should be easier, however it would still require significant system resources to be constantly transcribing everything it hears.

Sorry if this is a bit of a high jack, I think it is a closely related idea, but not exactly the same. What about speaker diarization(who is speaking, not what is being said), is it feasible to feed an audio stream to another PI like device and stream back who is talking? in anything approaching realtime? or is the problem as big as continuous stt?

Should probably start another thread for that, there’s some work on speaker ID out there already.

you could create a “skill we have to talk about”. If I am alone and have several inquiries (weather, music, stock market …) you could have a real conversation. Basis for sending the request to the server one could define the audio level. wait until quiet then send to stt. To conclude, thank you for stopping the conversation or stop after 30s.