Mycroft Skills etiquette: How should they interact?

I am not sure if it is a detail. The most intuitive way, in my view, is to say “stop” when you want that your assistant stops what it currently does in a narrow sense. If it talks, it should stop talking. If it displays some specific content, it should stop displaying that specific content. If it plays music, it should stop playing music, etc…

The semantic ambiguity of “stop” may very well require that Mycroft understands much more context - and ultimately also personal context (that can become personal data if Mycroft collects enough to finger print users). So, I think that skill etiquette depends on/interacts with data privacy at least.

Probably, it also depends on accountability, e.g. when an assistant does something for you that you rely on. For instance, you may want Mycroft to order 10 pizzas for your birthday party. While Mycroft utters that it successfully sent the order, you quickly realize you want someting else and tell Mycroft to stop. Does that mean it should stop uttering or cancelling the order? Should there be a different “default” for skills that perform something that may cost you money?

2 Likes

In lack of context, a layered cake of “skill hierarchy” could be applied. It’s generally obvious that when music is playing and the alarm goes off, a single “stop” command is directed at the alarm.

At least do I not see any case where the opposite would be expected.

I do realise that implementing “skill hierarchy” generally is no simple thing, maybe it could be a personal setting in each skill?

1 Like

I think a lot of people lose track of what day of the month it is. Very few lose track of what year it is. I think it is generally a good policy to minimise clutter in mycroft responses. If I want to quickly check what the date is, then ideal response is “today is the 28th of September”. I most likely know what day of the week it is and what year of the century it is. I can ask specifically for that info on the rare occasions I need it. Long winded responses from a voice assistant quickly grows old. The music request should definitely not be answered preceded with a long winded announcement unless as pointed out it is taking an unusually long time
I like the nesting of active contexts in the demo. Alarm going off is active, then stop applies to that and not to background contexts. Could have some catchall as suggested such as “mycroft quiet” if I need immediate silence to talk to someone in the room or answer my phone. (might be a nice skill for mycroft to recognise a phone ringing)

2 Likes

When there’s a display then anything ambiguous or generic (like “stop”) should go to the displaying application.

If the displaying skill doesn’t handle it but others do then “do you want to stop timer1, egg timer or music” seems reasonable rather than stopping something you didn’t want stopped.

When the timer goes off it would be nice if Mycroft entered listening mode whilst it is sounding as either you’re not there to hear so it won’t matter or you’re likely to want to speak and acknowledge the timer.

some other comments that I would vote for:

  • Not speaking the year
  • Music response is too long (though 10s is an eternity)
  • Visual feedback of listening

Nitpick: The timer display of -ve numbers is quite geeky if you think about it.
It should ideally say “10s ago” and maybe change colour? I’m not sure if there is space to put “went off 10s ago”.

1 Like

Time and date were fine.
If the music could be found “quickly”, skip the intro announcement.
Stopping the timer and music:
The timer should continue until verbally stopped (good).
The issue, thoush, is how should Mycroft interpret potentially ambiguous commands?
If the timer was still running and the music was still playing and the command was “Hey Mycroft Stop”, Mycroft would not know which skill to stop. Relying on what is displayed is not an great option as it may be running on a display-less device or the display can’t be seen for some reason. If it can’t reasonably be determined which skill the command is meant for, Mycroft could come back and say “Stop timer or stop music?” and wait for the response.

About the discussion about using “Thank you Mycroft” to stop a response, it seems inconsistent with how Mycroft interactions work. Unless Mycroft asks for clarification, it is simplest if any/all commands start with “Hey Mycroft…” In the case of a rambling response to a badly stated question, simply “Hey Mycroft, stop”. In the middle of a response, there is little ambiguity as to what needs to be stopped.

1 Like

Thanks for all the awesome feedback everyone!

It’s clear that there’s never going to be 1 set of behaviour that works for everyone. We’re all different and we have different expectations.

As Michael said - in the first round we want to implement good default behaviour. Something that works well for most people and can be manually modified if you choose eg through a setting change. In time the system should learn different users preferences and adapt itself - but this is a much longer term ambition :smiley:

I won’t reply to everything directly but there’s a few things I wanted to drill into a little more…

Stopping things

Some intents are nice and simple to infer - “stop the music” you would expect to… stop the music. It gets less clear when looking at generic phrases like “stop”. If a timer or alarm is actively beeping then we might assume that this rather than the music is what should be stopped. But what happens if we have a Timer displayed on the screen and music playing in the background? Should we ask which thing to stop? Stop the most recently activated process (ie LIFO)? Have a defined priority order - Expired thing > active timer > music > other? or something else?

It is relatively easy to add more back and forth conversation and/or settings to handle situations like this - but the more we can infer these correctly and not need to bother the user with more detail, the better the experience.

As you said in a follow up post, I can see myself saying this when a Timer has expired or even if Mycroft is being a bit long winded and I want to cut off the speech. But if I asked for some music, it started playing, and I said “thank you Mycroft” - the intent is not for the music to stop. Can you think of other examples where “thank you” is or isn’t a termination intent/command?

Undoing actions

This is a really interesting scenario both in terms of what “stop” should do but also around what actions should be “undo-able”? Would love to hear more thoughts or examples of things that are important to undo.

Expired Timer display

Definitely not a nitpick - this is great feedback! We want to hear it all - big and small :slight_smile:

Visual indicator of system state

Currently the LED’s on top of the unit perform this function but the intention is to have this reflected on the screen as well. You can see the start of this if you set your device to grab the “latest” updates via home.mycroft.ai. Are there other states or information that you think could be communicated through this other than “ready”, “actively listening”, or “processing command”?

Stopping: it’s indeed nice if Mycroft could infer what you’re trying to stop, rather than (always) add questions. But note that there are options in between as well:

  1. it could infer, yet you’d still have the option to say “that’s not what I meant” (or some command like that), to nip in in the bud
  2. it could infer, and ask “OK?” (when in doubt) to get explicit confirmation
  3. only if pretty clueless it would indeed have to just ask what to stop

Thank you I think that’ll do wonders if it effectively means “shut up”, but shouldn’t extend beyond that; gets way too subltle = confusing.

Also means that the actual “stop” could be “stop action” (not just stop confirming). If it’s confirming that 10 pizza order, I suppose there might be a lot of text, like total price, when they’ll be there, etc. If that order is somehow a mistake and I want to stop it going through, I do not need further confusion around the commands…

Quick to glance Please don’t bury information in extensive text (hope that simply won’t fit). I’ve been surprised while visiting the USA how much text there is in trafic everywhere! In Europe, we use road signs for that. IMHO that’s much more readable at a glance and less dependent on knowing a language well.

It might be a tad geeky, but a “-X s” (in red?) is also very clear and remains practical for those who use timers all day long, as it is concise and quick to read. Which also means easy to read for kids who don’t read that well yet, anyone who should wear glassess for proper reading but is still at breakfast, folks who don’t speak the language very well, etc.

(Anyone confused a first time, will see what it means anyway by staring at it for litterally an extra second - “Oh, it goes UP again…”)

Processing command Sound like a really good set of three simple indicators, that just didn’t come across yet over this video medium.

1 Like

Good discussion around LIFO/stack of context vs. specifying context. It seems best to me that Mycroft handle both.

My remark is actually mainly about the video content. For completeness, the video could anticipate the variety and demonstrate both. So perhaps after the demo as is, the user then pauses a moment and starts a different song, and starts another timer, and this time at expiration, utters “Mycroft, stop music” and then a little pause and then “Mycroft, stop timer”. And finally, user then asks Mycroft how long the timer ran, to which Mycroft replies “five minutes and twenty seconds.” (Adding the absolute value of the time overage to original duration).

1 Like

Concerning “undo”, I think, actions that cause one of the following should be undo-able:

  • result in immediate or future payments (to protect from financial loss)
  • result in configuration changes (to avoid misconfigs that drive users crazy, e.g., changing the language to something you do not speak → requires a universal keyword for undo)
  • control events for Internet of Things devices (to mitigate damage or other harmful/unwanted results)

These would be my top priorities. The second one is universal for all skills, while the first and second concern specific skill types only (but these are the really useful ones, I guess).

1 Like

It may be helpful to see this as another instance of skill-based intent recognition, giving skills a chance to handle things first.

For example, the timer skill may be able to handle “stop timer {name}” to stop a timer by name. It may also handle just “stop”, but this should trigger a fallback if no timer exists. Similarly, the music skill could handle “stop music” or just “stop” (trigger fallback if no song is playing).

If the skills “fall back” in LIFO order of use, it would produce the expected behavior.
With a song playing, then a timer set and beeping:

  1. “stop” will disengage the current timer
  2. “stop” again will pause the music

The benefit of this approach is you could still say “stop music” to leave the timer alone. Additionally, Mycroft itself may have a meta skill that can handle phrases like “stop everything”.

1 Like

The commentary around the word “stop” actually reminds me of a surprising interaction I had with a proprietary home-assistant device I had while at a friend’s place.

My friend has much of his A/V gear hooked up to his HA device, so we were able to tell it to turn on the projector and start streaming a movie to it. At some point, we put some pizzas in the oven and asked the HA device to set a timer to tell us when they were cooked, and resumed watching the movie.

At some point during the movie something interesting happened, so we wanted to pause the movie to discuss the interesting thing that happened in the movie, but our attempt to pause the movie didn’t work. Puzzled, we tried again; it worked the second time.

After several minutes discussing whatever plot point had taken our interest, we resumed the movie. Some time later, we noticed a burning smell; we jumped up and pulled the pizzas out just in time! They were extra toasted around the edge, but still edible.

I’m sure you can tell what happened: our first attempt to pause the movie actually paused our cooking timer!

In this case, I don’t think his HA device didn’t have a screen, so it could be difficult for the device to choose what we wanted to pause. But even a display might not help; the HA device may not be able to determine what it is we are focused on and therefore what should be paused.

Perhaps an interaction like this would have helped:

  • Me: “pause”
  • HA: “the movie, or the timer?”

I have no idea what you would do if there were multiple times, either with or without a movie or background music… A conundrum to be sure!

PS: if the HA device did choose one thing when there were multiple candidates, perhaps some unambiguous feedback would help; for example, the situation above would have been easy for us to resolve if my friend’s HA device told us “I’ve paused your timer”. I don’t recall there being any feedback at all, or if there was it was drowned out by the movie…

3 Likes

I don’t do CrossFit anymore, but when I did the following would have been great.
An example might be have Mycroft start workout music playlist, start a timer for 3 minutes of say jumping rope with a 5 sec count down and alarm at the end of the 3 min, then 1 minute of rest with 5 sec count down and alarm at end of minute, then 2 minutes of burpees with countdown and alarm, and repeat this a total of five times.
“Mycroft start a workout with music, exercise announcements, countdown timers and alarms.”
“Ok. What music do you want?”
“World’s best workout playlist”
“Ok. What is your first exercise?”
“Jumping rope”
“Ok. How long?”
“3 minutes with 5 second count down at end”
“Got it. Next exercise (and time with 5 sec count down)?”
“Rest (for 1 minute with 5 sec count down)”
“Roger that. Next exercise (and time)?”
“Burpees for 3 minutes with 5 sec countdown”
“Ok. Next exercise?”
“Repeat that 5 times and stop all but let the music play”

That would have been really nice to have.
If Mycroft would have called split times every minute (2 minutes left) and in last minute every 30 sec that would have been really nice.
If Mycroft could be connected to a heart rate monitor it could call out/display your heart rate occasionally.
If you count your reps, Mycroft records your count, and at the end tell you how many you did, and where your heart rate was, and what your heart rate did while you recovered after the work out.
Or maybe there is an app I don’t know about that does all this, and you just say “Mycroft, run world’s best workout app”

If you like your workout, you can name it and save it and not have to recreate it. Share workouts with friends. Work out together at the same time (option for separate playlists) or asynchronisely.
Also Mycroft could save results of workouts if you count reps out loud - first round 30 pushups in 1 min, heart rate X to X during 1 minute rest, 20 burpees in 1 minute, heart rate …, second round: pushups, heart rate, burpees, heart rate, 3 rd round, etc. Look for progress over time, compete with friends.

Could Mycroft sync to multiple headphones and play different playlists on the different headphones so people can have their own tunes?

1 Like

I did not expect the year when asking for the day.

I expected the music to just play, not to receive a confirmation.

When the timer was being set, I did not expect the music to disappear, I expected it to continue with a lowered volume. Like when I’m listening to music on my phone, the sound is just lowered to signal a notification.

For the stop command, I expected a question : “Stop what?” or “The music or the timer?”

1 Like

Thanks for the extra feedback all - it’s all very helpful in determining the behaviour we should expect. This will be used to inform the design of the technical processes sitting behind the scenes.

I’ve just posted another video and would love your feedback on this video too.

Thanks!

Hey all, sorry for the slow response here, I’ve been out of office for a while. These are all great suggestions that we will consider in the Skills Interaction Sprint.

In the disambiguation example that @plmorel described we are considering taking some assumptions when you say “stop.” For example if a timer or alarm is an expired state we think it’s safe to assume they want to stop the beeping timer or alarm and not the music. The beeping timer or alarm at this point would be in the perceived foreground.

@Msquared also mentioned a problem that would have benefited from disambiguation, the movie vs timer scenario. It probably would have been ok if the Assistant would have ducked the audio and responded with “I’ve paused the Pizza timer.” The Assistant is making an assumption, and in this case, the wrong one, but at least you know what happened. I think giving feedback when things are ambiguous give the assistant some more agency to “guess.”

I’m not saying these two solutions are THE solution, but we do want to minimize disambiguation as much as possible if we are HIGHLY confident we have the right answer.

Also everyone should keep in mind that a lot of the solutions we will be working on in the Skills Interaction Sprint will be the mechanisms to allow these interactions to happen. If we do it correctly we can change priority, or add disambiguation, etc… to react to user feedback. Right now we need the system to be aware of these types of clashes.

1 Like

Tell Mycroft to “Silence all” or “Silence everything” for just that.
Tell Mycroft to “Silence” all/everything except/but . . . " and list the audible skill(s) that should continue and all else is stopped.
Any skills (the timer continuing to count down, cooking breakfast, returning Pluto to its rightful planetary status, etc) that are not being done through the speaker would continue.
So “silence” may work better than “stop”. Just need to train the human.

The user may want silence to hear Mycroft’s timer when it goes off. Or the user may want complete silence and NOT hear the timer. That’s a (rare?) context specific situation. If its a repeating situation there should be a way for the user to specify audible timer alarm vs silence. Not sure what the initial default would be.

Mycroft may “introduce” itself to any new user as a helpful machine/gadget and go over things like “stop all” vs stop specific skill(s) vs “silence all/everything” vs silence all but specific noise producing skill(s). Either one long introduction or many smaller ones depending on user preference. May need optional reminders when requested.

Speaking of user preference and just for me, I don’t want to tell Mycroft “thank you” for anything. It’s going to be a helpful machine but its obviously not human. Even though it is obviously not human, it’s obvious that humans can treat/think about machines as human. I don’t want to ever start blurring that line between machines and humans. I don’t want other people to blur it either, but it’s their call if they want to or just don’t care.

I would also like to rename my personal Mycroft the name HAL (yep, from 2001 Space Odyssey) to help me remember that. I’d actually prefer to rename my computer “computer” or “machine” (more neutral and less silly than HAL) but that would lead to too many inadvertent wake ups. I want Mycroft to help me be better to humans and me not treating/thinking of Mycroft as human is a good place to start.

Mycroft to me: “Hey moron, so-and-so’s birthday is next week. Try to do something human. Don’t look at me. Can’t help you with that.” Most people absolutely do not want this particular feature. Just trying to make a point.

Does Mycroft’s own speaker (depending on volume) interfere with Mycroft’s ability to hear commands?

1 Like

@Msquared 's scenario is really interesting. It seems like Mycroft might benefit from some additional categorical metadata, like active skills that have a “duration” quality to them (such as timers, or audio streams). Mycroft then could evaluate if there was a plurality (or not) of things that could be “stopped” on command. This would also allow skill authors that come up with new and creative things that don’t have an implicit or obvious “duration” quality to just explicitly make Mycroft aware of it (Do not Disturb Mode, perhaps?).
Another complication: some kinds of skills, like media playing (indeed probably as was the case in @Msquared 's example) are implemented in a “stateless” way (kind of like HTTP) where “PLAY” and “PAUSE” are point-in-time commands, not ongoing states/connections. So Mycroft (and/or the skill authors) may need to include some kind of “state stack” so he could keep track of whether a movie was (probably) still playing, seeing as how the most recent command he had issued was a “PLAY”.

1 Like

This is largely hardware dependent. So on the Mark II we have Acoustic Echo Cancellation which essentially subtracts the audio output from the audio input meaning you can speak over the top of it very easily. On the Mark 1, desktop or Picroft installs (depending on the hardware) this isn’t the case so you really have to yell to be heard.

The other way to overcome this is hardware buttons. We’ve got those on the Mark 1 and Mark II, and on desktop you can setup a key combination. Hitting this acts like a wake-word trigger letting you issue a voice command without saying “hey mycroft”.

This may have already been suggested but … If Mycroft is playing music and another request is initiated the music could be muted and continued playing instead of pausing.

1 Like

Hey Mark, I don’t know whether that has come up, but I can totally see that behavior happening too. Seems like it might depend on what type of media it is eg music vs a podcast.