Wake, Stream, Sleep?


#1

I have only worked with Alexa skills previously, as well as chatbots (text typing and button pushing), so please excuse this basic question. I searched the docs and the source code, but could not find what I was looking for, so I am missing a keyword and/or architecture understanding.

I want to have two WAKE/HOT words, one is “Order” and the other is “Dictate”.

When I say “Order”, I will use one set of skills that understands the variety of items that are being ordered, and takes the appropriate actions.

When I say “Dictate”, I will use a different skill. The job of Dictate is to continuously perform a speech-to-text until I say “End Dictate” or “Stop Dictate”.

I could not find how the system knows when to stop streaming the words I am saying until the next WAKE/HOT word. Does it time out if there is just a set amount of time with no words? Can I set that time dynamically? Does the system constantly stream to the back-end without sleeping? Does a skill completion signal the end of the interaction?

I don’t necessarily want the system to stop streaming when I am in “Dictate” mode, but I do want it to stop streaming when I say “End Dictate” or when an “Order” skill signals that it is complete. So, how is it signaled to sleep and stop streaming to the ASR server?

Please direct me to the documentation and source code so I can review. I am already aware of these two wonderful Jarbas packages that can provide some assistance, but I am missing a key piece of the architecture puzzle to put it all together. https://github.com/JarbasAl/local_listener https://github.com/JarbasAl/skill-dictation .

Thanks so much. I received several intelligent responses to my queries the other day. Appreciate it.


#2

The listener calls the STT engine:

There’s not a streaming function for that, it store-and-forwards.


#3

Thank you baconator.

I had looked at listener.py before but wasn’t sure exactly what I was looking for. Kept digging deeper into the code and found my answer.

Going to document here, just in case someone else has a similar question in the future, then ask another follow-up question at the end.

FINDING:
Determination of a phrase-start and phrase-end occurs in this file, https://github.com/MycroftAI/mycroft-core/blob/dev/mycroft/client/speech/mic.py, in this function, def _record_phrase(self, source, sec_per_buffer). Reading the code briefly, it looks like it determines that there was sufficient sound to indicate a phrase was most likely spoken, and then sufficient quiet at the end to determine that the phrase is done. The configuration parameters for this start right after the class definition for class ResponsiveRecognizer, and are currently hard coded constants (easy to change if required). The chunk is turned into audio, saved, and then it will be fed to the speech-to-text recognizer. Note that currently it looks like a 3 sec of silence, and a maximum of 10 secs per chunk of speech.

FOLLOW-UP QUESTION:
This will probably work fine for my “ORDERS”, but not work well for the “DICTATION”, since most people have longer pauses when they create dictation, and they are unlikely, even with a short wakeword such as DICTATE, to remember to say that at the start of every phrase.

Any experts have an intelligent way to do this with Mycroft? Grab a minute or two of spoken dictation, with a few secs of pause, and then hand it off to the STT, without having to use a WAKE WORD before every phrase, and after every pause of more than 3 secs? I understand, in this scenario, I will have to have the user utter a “END DICTATION” HOTWORD. Not sure if I can do that with a WAKE/HOTWORD, as Mycroft and the code base is still very new to me, and not sure about the parallelization.

Thanks for any additional help.


#4

That is exactly what I understand Jarbas is dooing in the dication skill


    def converse(self, utterances, lang="en-us"):
        if self.dictating:
            # keep intents working without dictation keyword being needed
            self.set_context("DictationKeyword", "dictation")
            if self.check_for_intent(utterances[0]):
                return False
            else:
                self.speak("", expect_response=True)
                LOG.info("Dictating: " + utterances[0])
                self.dictation_stack.append(utterances[0])
                return True
        else:
            self.remove_context("DictationKeyword")
            return False

#5

Can I also add a gentle - please be sensible with dictation in skills.

Mycroft isn’t intended to be a general speech-to-text or text-to-speech service. If you are creating something that you think has broader value to the community and would need to use our STT or TTS services more than a usual skill, please get in touch so that we can talk through the options.

I just wanted to make sure this was flagged for anyone reading along as these are computationally expensive processes so we need to be mindful of the impact this can have.

It’s great to see you’re looking at options for locally processing that data. Alternatively you could use a 3rd party provider. As an example Google STT will handle up to 1min synchronously and up to 480mins async if you use their cloud storage for the data source. You get up to 60mins free per month and $0.024 for every minute after that.


#6

Perhaps this will work right out of the box. I just wasn’t sure what would happen with long pauses (turns out longer than 3 secs), and how I could end the dictation on purpose, without having to send it to the current back-end process. I didn’t want to send a 2 min dictation to a back-end server, trying to match intents, when all I really want is a STT.

So, the issue becomes, how do I control the silences, end when the user requests it, and then do a STT, and forward the result to a different back-end server for inclusion into the database.

Thanks!


#7

Hi Gez,

Yes - what you say is correct. I didn’t include all of the details, but in my current diagram of the process, the “ORDER” follows a pretty standard Mycroft path, and the “DICTATE” really goes to a STT, such as Google. I did not want to stream the dictation, or store and forward it to your current back-end, or one of our own, since it does not need to run through the intent matching process, just the STT.

That’s the reason I was looking at how the phrases are chunked, multiple WAKE/HOT words, etc…, so I can choose the process based on the WAKE/HOT word.

Thanks!