Always Listening While Always Parsing

Contact in case you guys would like to talk: http://audeme.strikingly.com/#contact

Can recognize several hundreds of user defined sentences

Seems that this may be severely limited in the things that it can recognize, though potentially a better solution for wake word than pocketsphinx (which is a little resource-heavy). @daniel, do you know if this supports parameterization in the grammar, or is it just fixed string recognition?

Examples:
weather in #city#
score of #team name# game
price of #stock symbol#
call #contact#

To be clear, we’re most likely looking for 2 speech solutions. One that will definitely run on device and only needs to support a small, fixed grammar (the wake word), and a second that supports arbitrary dictation. The latter (at this time) is going to be running off-device, as we don’t have a feasible on-device solution that could support a large variety of functionality.

I’m not sure what you mean.

In speech recognition, there’s typically two styles: grammar based (definition of a formal grammar for all things recognized), or dictation, which would recognize any spoken language. The description of “recognizes hundreds of user defined sentences” suggests something even more limited than a grammar based recognizer, which would allow for a massive combination of utterances based on parameter substitution.

From my example above, imagine the differences between the following grammars (as complete specifications of what can be recognized)
In a parameterized grammar:
Contact

  • Mom
  • Dad
  • John
  • #insert 700 other contacts#

Location

  • home
  • work
  • cell

call #Contact# on||at #Location#

vs
call Mom at home
call Mom at work
call Mom at cell
call Dad at home
call Dad at work
call Data at cell
call John at home

Does this help clarify a bit?

In wildlife photography, for capturing those videos of whales leaping out of the water, etc, that happens rarely and unpredictably, they don’t want to record 24/7 waiting for something to happen as it’ll use up all of their space and 99.9% of it will be thrown away.

Instead, they have a camera that is always recording but only stores the last minute or two of footage. When the cameraperson sees whale do something noteworthy, he or she presses a button which tells the recording equipment not to delete and to keep recording until its told to stop. This way, only intetesting footage of the event and the preceding minute building up to that event is recorded.

Apologies for the convoluted example, but what if Mycroft did something similar? I.e always record but only store the last x minutes on a rolling basis unless the keyword is said. This would give it some context, allowing what you wanted but without constantly uploading data to cloud services

3 Likes

Yes, the implementation will likely look something like this. Currently, we’re attempting a technique that starts by detecting a base level of background noise, then listens for speech (noise above a threshold followed by a drop back down to background noise). That data will be locally transcribed to determine if a key/wake phrase is present, and then full STT will be executed if it passes that minimum threshold.

1 Like

@seanfitz Okay, I’ not sure if the Adept Intent parser is in use in the early builds, but if you notice in the campaign video ( https://www.youtube.com/watch?v=g1G0yEKuED8 ), there’s a gap between saying Mycroft and making the request, which feels unnatural. For example:

"Mycroft"
Pause for a second.
Mycroft beeps.
"Play some terrible 80’s music"
Music plays.

However, what would feel more natural would be the following:

"Mycroft, play some terrible 80’s music"
Mycroft beeps.
Pause for a second.
Music plays.

I’m just thinking/typing out loud here, but I think that this could be done if the processing was done separately from the recording.

So it is always recording for, say, the last ten seconds in a constant loop.

However, it would only processes it if an increase in volume is detected. So meanwhile, a separate process is constantly scanning the previous second for a change in volume over a given threshold. If a change in volume is detected, the parsing flag is changed from False to True and the previous ten seconds plus whatever follows is kept up until the volume decreases again, as you described.

If the word “Mycroft” is not detected, it is discarded.

It is is detected, however, the whole lot is sent off for cloud parsing.

There is also the potential for continuing to record for another 20 seconds or something after to check for follow up requests, maybe?

What do you think? This may be exactly what you already do, but just mentioning it in case it’s not. Strikes me as a good way to improve reaction times. Also, you could insert the word Mycroft anywhere in the sentence, rather than always at the beginning, e.g.

“Play some terrible 80’s music please, Mycroft”

Sounds like a good idea. And yeah, Adapt wasn’t in the video.

1 Like

@daniel Adapt wasn’t in the video, as I had not yet joined the project or offered up the code for inclusion.

@Autonomouse your english text is a significantly more verbose (but nonetheless accurate) reading of my current python code :smile:

3 Likes

In which case: “yey”

Glad to know I wasn’t completely off the mark then. I look forward to looking through to see how you did it

Cheers

1 Like

A potential way to deal with the privacy issues would be to have a command to make Mycroft stop listening.
User: "Mycroft, cover your ears."
Mycroft: "You may speak freely."
I guess that doesn’t really secure the content of the audio data being sent.
Would it be possible to split up the audio into multiple pieces that can be sent to different servers,(perhaps in scrambled order) or does the TTS engine require context to make an accurate prediction? If this is possible might there be a performance boost analyzing each word separately? (asynchronously)

1 Like

To play devils advocate, the paranoid end-user won’t be convinced by this as you’re still taking it on trust that Mycroft isn’t listening - Maybe a physical hardware off switch might inspire more confidence? (Or you could just unplug the Ethernet :slight_smile: )

the paranoid end user likely wouldn’t be using Mycroft in the first place fearing a shadow government may be listening to spoken near an adjacent mic. thing is thing is privacy is antiquated, a commodity that could be kept back in the days when light bulbs, mp3 players, and dog collars were not connected to the internet

Ok, we’re now 2 years down the road. Where does this idea stand with the current progress Mycroft and Raspberry Pis have made?

Whatever progress you’ve made is going to be it.

There is Kaldi-spotter by @JarbasAl which goes in the direction of „always listenung“. You can define phrases which trigger intents without using the hotword.

1 Like

Disclaimer, I haven’t read this whole thread, however I can pretty confidently say this won’t be coming to Mycroft (at least in the next few years).

Using a local STT engine like Kaldi and defining a limited number of phrases (as Jarbas has) means you could achieve a portion of this. However a broad all listening and all understanding assistant is economically and technically not feasible (on current hardware) at this time.

If you take a look at the cost per minute of STT transcription from even the largest players like Google, this roughly shows how expensive it would be for Mycroft to be performing this. Once more STT moves on device this should be easier, however it would still require significant system resources to be constantly transcribing everything it hears.

Sorry if this is a bit of a high jack, I think it is a closely related idea, but not exactly the same. What about speaker diarization(who is speaking, not what is being said), is it feasible to feed an audio stream to another PI like device and stream back who is talking? in anything approaching realtime? or is the problem as big as continuous stt?

Should probably start another thread for that, there’s some work on speaker ID out there already.

you could create a “skill we have to talk about”. If I am alone and have several inquiries (weather, music, stock market …) you could have a real conversation. Basis for sending the request to the server one could define the audio level. wait until quiet then send to stt. To conclude, thank you for stopping the conversation or stop after 30s.