Always Listening While Always Parsing

If Adapt is added to Mycroft locally (which would be awesome even if it’s a setting), could we have Mycroft try to parse all speach? This way Mycroft can add into conversations and would allow Mycroft to be conversational without instruction. I can see it being annoying but if you only have a few commands up or make special ones for certain projects, it would be really helpful.

It would be so cool if someone walked into your house started talking and then mentions the extremely hot weather and then Mycroft would give some information about the weather and how it’s record high in San Diego.

This should not be default unless it is fairly unintrusive as well as it should only do commands you pick. For instance, I wouldn’t want it to launch Linux commands or start playing video. Saying what Mycroft wants to do like “I took it that you’re warm, turning on the A/C” or asking to confirm intents may be helpful.

When I’d like Mycroft to initiate or add into a conversation (may be different for everyone):

Simple Knowledge:

  • a math question
  • spelling of a word or phrase
  • translation
  • asking about something like “How long is the Great Wall of China?”
  • saying something like “I thought that pizza was pie in Itallian” (This would be so fun. Mycroft could inform how wrong they are)

Responce to Reactions:

  • complaining about the temperature should change the thermostat
  • saying “I forgot to lock the front door” should lock it

I’ll probably add more to remember some cool things I could add to Mycroft.

Sorry if something doesn’t make sense, Fleksy freaks out web text input fields.

Well parsing all speech seems to be very intrusive and will take a lot in computing and drive space

I prefere the keyword trigger approach

Yeah I figured it would be intensive, but I think it’d be worth it if the Pi can handle it.

Hey Daniel! Right now, Adapt is only targeting text-to-intent, and Mycroft will rely on a 3rd party service for speech-to-text. There will be a local speech recognizer, but it will be scoped almost exclusively to the wake word. A general purpose speech to text service would (at this time) be too intensive to run on-device, and streaming audio to a 3rd party 24/7 is likely impractical and raises serious privacy concerns.

I agree this opens up a lot of interesting use-cases, but I don’t consider it a reasonable goal for v1.

2 Likes

If something like this were added in the future, I think it would be better for keywords to cause an indicator light. And when asked, “Mycroft, what?”, Mycroft would give a suggestion based on the keywords heard (ie. " I heard you were cold. Shall I increase the thermostat?"). And I would expect mor success with hardware optimized for audio transcription discussed in this article.

1 Like

Ah I see. That make a lot of sense. I can’t wait until it’s tackled! :smiley:

I would say the idea is really neat! But we’ll have to see, as a community, if there is any way to implement this that ensures users’ privacy and a solid experience.

1 Like

Have you looked at MOVI? It is custom built speech recognition hardware. https://www.kickstarter.com/projects/310865303/movi-a-standalone-speech-recognizer-shield-for-ard/description

It seems like a very natural fit for these two projects to collaborate.

1 Like

Contact in case you guys would like to talk: http://audeme.strikingly.com/#contact

Can recognize several hundreds of user defined sentences

Seems that this may be severely limited in the things that it can recognize, though potentially a better solution for wake word than pocketsphinx (which is a little resource-heavy). @daniel, do you know if this supports parameterization in the grammar, or is it just fixed string recognition?

Examples:
weather in #city#
score of #team name# game
price of #stock symbol#
call #contact#

To be clear, we’re most likely looking for 2 speech solutions. One that will definitely run on device and only needs to support a small, fixed grammar (the wake word), and a second that supports arbitrary dictation. The latter (at this time) is going to be running off-device, as we don’t have a feasible on-device solution that could support a large variety of functionality.

I’m not sure what you mean.

In speech recognition, there’s typically two styles: grammar based (definition of a formal grammar for all things recognized), or dictation, which would recognize any spoken language. The description of “recognizes hundreds of user defined sentences” suggests something even more limited than a grammar based recognizer, which would allow for a massive combination of utterances based on parameter substitution.

From my example above, imagine the differences between the following grammars (as complete specifications of what can be recognized)
In a parameterized grammar:
Contact

  • Mom
  • Dad
  • John
  • #insert 700 other contacts#

Location

  • home
  • work
  • cell

call #Contact# on||at #Location#

vs
call Mom at home
call Mom at work
call Mom at cell
call Dad at home
call Dad at work
call Data at cell
call John at home

Does this help clarify a bit?

In wildlife photography, for capturing those videos of whales leaping out of the water, etc, that happens rarely and unpredictably, they don’t want to record 24/7 waiting for something to happen as it’ll use up all of their space and 99.9% of it will be thrown away.

Instead, they have a camera that is always recording but only stores the last minute or two of footage. When the cameraperson sees whale do something noteworthy, he or she presses a button which tells the recording equipment not to delete and to keep recording until its told to stop. This way, only intetesting footage of the event and the preceding minute building up to that event is recorded.

Apologies for the convoluted example, but what if Mycroft did something similar? I.e always record but only store the last x minutes on a rolling basis unless the keyword is said. This would give it some context, allowing what you wanted but without constantly uploading data to cloud services

3 Likes

Yes, the implementation will likely look something like this. Currently, we’re attempting a technique that starts by detecting a base level of background noise, then listens for speech (noise above a threshold followed by a drop back down to background noise). That data will be locally transcribed to determine if a key/wake phrase is present, and then full STT will be executed if it passes that minimum threshold.

1 Like

@seanfitz Okay, I’ not sure if the Adept Intent parser is in use in the early builds, but if you notice in the campaign video ( https://www.youtube.com/watch?v=g1G0yEKuED8 ), there’s a gap between saying Mycroft and making the request, which feels unnatural. For example:

"Mycroft"
Pause for a second.
Mycroft beeps.
"Play some terrible 80’s music"
Music plays.

However, what would feel more natural would be the following:

"Mycroft, play some terrible 80’s music"
Mycroft beeps.
Pause for a second.
Music plays.

I’m just thinking/typing out loud here, but I think that this could be done if the processing was done separately from the recording.

So it is always recording for, say, the last ten seconds in a constant loop.

However, it would only processes it if an increase in volume is detected. So meanwhile, a separate process is constantly scanning the previous second for a change in volume over a given threshold. If a change in volume is detected, the parsing flag is changed from False to True and the previous ten seconds plus whatever follows is kept up until the volume decreases again, as you described.

If the word “Mycroft” is not detected, it is discarded.

It is is detected, however, the whole lot is sent off for cloud parsing.

There is also the potential for continuing to record for another 20 seconds or something after to check for follow up requests, maybe?

What do you think? This may be exactly what you already do, but just mentioning it in case it’s not. Strikes me as a good way to improve reaction times. Also, you could insert the word Mycroft anywhere in the sentence, rather than always at the beginning, e.g.

“Play some terrible 80’s music please, Mycroft”

Sounds like a good idea. And yeah, Adapt wasn’t in the video.

1 Like

@daniel Adapt wasn’t in the video, as I had not yet joined the project or offered up the code for inclusion.

@Autonomouse your english text is a significantly more verbose (but nonetheless accurate) reading of my current python code :smile:

3 Likes

In which case: “yey”

Glad to know I wasn’t completely off the mark then. I look forward to looking through to see how you did it

Cheers

1 Like

A potential way to deal with the privacy issues would be to have a command to make Mycroft stop listening.
User: "Mycroft, cover your ears."
Mycroft: "You may speak freely."
I guess that doesn’t really secure the content of the audio data being sent.
Would it be possible to split up the audio into multiple pieces that can be sent to different servers,(perhaps in scrambled order) or does the TTS engine require context to make an accurate prediction? If this is possible might there be a performance boost analyzing each word separately? (asynchronously)

1 Like

To play devils advocate, the paranoid end-user won’t be convinced by this as you’re still taking it on trust that Mycroft isn’t listening - Maybe a physical hardware off switch might inspire more confidence? (Or you could just unplug the Ethernet :slight_smile: )