Initial thoughts and a few questions

Misha · December 12, 2020, 9:32pm

Hi everyone, I just thought I’d share my initial experiences with Mycroft and hopefully get some up-to-date answers while I’m at it.

I bought a Pi 4B, a Blue Snowball mic and some Creative Pebble 2 speakers. Installing/imaging Picroft was easy with Etcher. Booted up, HDMI output didn’t work so checked my router for the DHCP-allocated IP address and SSH’d in. No problems so far other than having to wait a while until initial boot had finished (before I could connect) with no way of knowing (I’ve since installed the Finished Booting skill).

The first thing I did was change the robotic British male voice to the much better American male voice. I’m British so I would have preferred a Brit but the difference in quality made this a no-brainer (and I understand the local vs cloud processing reasons). But here’s my first question. I read that my subscribing to Mycroft you get better voices using Mimic2, but my American voice says it’s already using that. Another post says enable American male beta while it’s still free. What’s the state of play with this as of right now? Are there actually more/better paid-for voices?

Then I followed the instructions to set up a wireless connection and rebooted. All fine, but I feel like maybe the wireless credentials could be added on Mycroft Home for people less comfortable with the linux command line.

I added/configured my Home Assistant server and said “Hey Mycroft, turn on the hot water”. Boom. Hot water turned on. What a great start. Then I realised that was about all I could do with Home Assistant (unless I’m missing something), I couldn’t use its existing connections to my Sonos speakers, heating/climate controls, alarm system, cameras etc.

I wasn’t expecting to be able to say “Hey Mycroft, play me some Elvis on Sonos in the sitting room” right away, but that was certainly a goal. “Hey Mycroft, boost the heating” was another.

So next step, hook up a music service. I generally use Youtube Music (forced “upgrade” from Google Play Music), but that didn’t seem to be available, which is ok - I have Spotify Premium as well. That skill seems to have OAuth problems and can’t authorise to Spotify. I saw the thread about a workaround which I will try later, but right now I just wanted some music playing. I have Amazon Prime too so I tried the Amzn music skill but again, that doesn’t work anymore. I was running out of options. I managed to link it to my local Emby server (I had to force using a password on the local network or it wouldn’t authorise but at least it worked) and finally got some music playing. That’s fine if I want to listen to any one of thousands of my mp3s from the 1990s but the availability of streaming means that my local song collection is hideously out-of-date.

I feel a basic step like linking a music service should be easier than this!

I’d be happy to contribute to skills but my python is practically non-existent. If they were written in PHP I’d have churned out a bunch of pull-requests by now. Guess I’d better start learning python…

One final question - the whole process of answering questions or commands seems quite slow. I see the text output of the response and then a few seconds later it’s synthesised into speech. Is that a normal delay? Would something stronger than a Pi 4B make a difference or is that because I’m using a voice that’s coming from the Mycroft servers? Can I replicate that system locally for a faster spoken response?

Thanks for building a great tool though, I don’t mean to be critical I was just quite surprised that to get things to work you have to be pretty technical and have time to fiddle (I am/have both) but of course that’s what happens with free/open-source/community-driven software. Keep up the good work and I will start swotting up on python!

goldyfruit · December 12, 2020, 10:14pm

Hi Misha,

Welcome to the community, I joined few weeks ago and I hear you on the issues you got and for sure things could be improve to ease the Mycroft journey. This is with people feedback that it’s begin.

I was really annoyed by the Spotify skill issue, which I know is not Mycroft community issue but a decision from Spotify to shut the API for this voice assistant. I tried the work around, not very user friendly but at the end it worked.

About Sonos, there are couple skills available:

https://github.com/lnguyenh/spotify-sonos-bot-skill which only work for one speaker and requires to install Node JS and Spotify
https://github.com/boxledev/sonos-controller which only work for one speaker and with local library.

I’m currently building a Sonos skill[1] (yeah another one…) which will work on any Sonos speakers you have and with different services (Spotify, Amazon Music, Local library, etc…).
This skill will be based on SoCo Python library which will directly connect to Sonos to retrieve the token sor the different registered services, they are trying to fix[2] the issue with the music services.

There are multiples way to improve the response time. For me it was the following improvements that made me stay with Mycroft.

Add tsched=0 to module-udev-detect in /etc/pulse/default.pa which improved a lot the wake word (reboot required, the process reload/restart didn’t work).
Reduce the "recording_timeout_with_silence" timeout from 3.0 to 1.0 second increase the speed of the request.
Use Google Voice as TTS has been a speed gain too.
Use Cloudflare DNS 1.1.1.1 which is has a pretty good response time.
Make sure your Raspberry Pi has a good Wi-Fi connection.

[1]https://github.com/smartgic/mycroft-sonos-controller-skill
[2]https://github.com/SoCo/SoCo/pull/763

I hope it helps.

ChanceNCounter · December 12, 2020, 10:27pm

You seem to have the right sense of where Mycroft is - intermediate-phase small-company-plus-FOSS project - but I think the more relevant problem is illustrated right here.

That is, there are a lot more moving parts here than it might seem at first glance. First, a framework that responds to “play” - because lots of skills might be able to “play” - will go find the skill that’s most confident it can play “me some Elvis.” But, wait, what about those extraneous words? What if there’s a better-matching skill, but it would only respond to “Elvis”? What if a skill wants to disambiguate between Elvises Presley and Costello?

But say it comes across clean, and the best-choice skill says “I can play you some Elvis.” Now it needs to do a whole separate thing with Sonos.

And there are implications here, as well. Worth mentioning that if it tried to find a skill to play “me some Elvis on Sonos in the sitting room” that’s just going to fail, because no catalog will find that.

But, if the correct granularity is accomplished - and I don’t think this exists right now, but I could be entirely wrong because I’ve never used any of the relevant skills - the “play” framework needs to feed the remainder of the input back into the intent parsers. That is, it needs to retain, “Spotify is ready to play Elvis,” and, having correctly sliced off the rest, feed “on Sonos in the sitting room” back into the works.

Then the Sonos skill has to say, “I can do that!” Except its intent wasn’t fed any “play ,” that information is being retained elsewhere. What needs to happen now is:

The Sonos skill needs to open a new audio source for playback, or the Spotify skill needs to do it and then communicate back
The Sonos skill needs to set that source’s output to whatever it uses to route audio, and route that source to your sitting room
The Spotify skill has to switch to that output
The Spotify skill starts playback
Mycroft needs to know that two possible meanings of “stop” or “stop playback” are “[pause/terminate] the running Spotify connection [then close the corresponding Sonos connection and end process]”

No small feat. It’ll get there, but wow that’s a lot of moving parts and complex evaluations.

ChanceNCounter · December 12, 2020, 10:51pm

It actually just occurred to me that there’s even more to it! Let’s whittle the utterance down.

“Play x in y” - is that a location, or an application, or is “in” part of the name of the thing to play? Heck, you might say, “Play Star Trek picture in picture.” Now we’re off to the races. What’s “Star Trek picture” and what application can play it “in picture?” Oh, “picture in picture” is a thing that your <desktop skill/BigScreen/whatever> can do. Okay. Which Star Trek? WHOA there are over 100 episodes of that! Which one do you want to watch, or should we just pick one? Or maybe you expect a particular streaming service to pick up where you left off. You didn’t specify. The skill may or may not ask you to clarify, depending whether it thinks that’s what you want it to do.

“Play x on y” - Same basic problem. Is y a service or a device?

“Play x at y” - Location or volume?

“Start x with y” - This doesn’t even have to be playback.

Indeed, “play” could mean “open this video game!”

Common frameworks for skills like these will interrogate the compatible skills, which will estimate their confidence that you meant to invoke that skill. Each skill needs to account for whichever of those possibilities apply. This stuff is hard!

Misha · December 13, 2020, 9:41pm

Thank you goldyfruit - I’ve already installed Sonos Controller and it works well although it could do with a few tweaks. I’ll look out for your skill in the future. Thanks also for the speed suggestions, I will try them out tomorrow!

Misha · December 13, 2020, 9:52pm

Thanks for your reply ChanceNCounter, I wasn’t really expecting it to know that I prefer Presley to Costello

I agree there are a lot of moving parts. But it doesn’t have to be quite so complicated when you break it down. Play [song/album/artist] on Sonos in [location].

The command starts with Play so we know it’s music or a video (or possibly a game?). Then scan the whole query for a few keywords. Once we’ve found the phrase “on Sonos” then that can be used as a delimiter, the part between “Play” and “on Sonos” must be something to do with a music request (since that’s what Sonos is used for) - and the only thing allowed after the phrase “on Sonos” should be a location/speaker name.

The tricky part then is only which music service to send the music query part to, if you could define a default or a search order that would help.

It’s obviously possible because Amazon/Google have pretty much mastered it (admittedly with FAR more money/time/people/testing).

I feel a bit sad though because I always tell people not to use Alexa etc. but at this point I couldn’t possibly recommend anyone to use Mycroft - well, any “normal”, non-technical people - the norms. I’d love it to get to a more user-friendly stage and will definitely be contributing to skills to get there in the future…

Misha · December 13, 2020, 9:53pm

Did anyone know about the subscription voices by the way?

goldyfruit · December 13, 2020, 9:59pm

I’m a subscribed user (Yearly Membership) and I didn’t see any differences, maybe I’m missing something. I subscribed to help the project, I didn’t know we had any advantages

pcwii · December 24, 2020, 9:49pm

I have built a usb music skill that will play music from a usb thumb drive, a network share or local path if you have your own collection. There is a link in the skills area of the forum. Cheers.

ChanceNCounter · December 25, 2020, 12:13am

@Misha What about a location without a service?

“Play Elvis in the sitting room” is pretty straightforward. A skill knows how to “play in the sitting room.”

“Play Alice in Chains.” What do?

ChanceNCounter · December 25, 2020, 12:24am

The challenge here is identifying which slices to operate on, and then which slices get priority, and the chaining of intents that need to register themselves in various ways with a framework that needs to reconcile them.

This might be an argument in favor of some kind of knowledge graph like Wikidata. Find things in the utterance that are Things, find the kind of Thing, go from there.

"‘Play {Elvis’}, 99% confidence: ‘Elvis Presley’, has attribute: ‘musician’, ‘Play {musician: Elvis}’

“‘Play {Elvis in’}, nothing, split utterance”

Of course, this still only gets you partway there. It would straightforwardly chunk the utterance (most of the time) to facilitate confidence on the next pass, and ease the burden on skills to parse a whole utterance. You’d still have a bunch to do.

Pass #1, resulting normalized utterance: “Play {musician: Elvis Presley} | in sitting room”
Pass #2, resulting sequence of possible intents: “{Padatious -> Spotify: play Elvis Presley} | in sitting room”
Pass #3: “{Padatious -> Spotify: play Elvis Presley} | {Padatious: ‘Play * in sitting room’ -> Sonos}”
Sonos to intent parser: wait for ready
Sonos: open audio out to “sitting room”
Sonos to intent parser: ready: audio device
Intent parser to Spotify: "Play Elvis Presley using audio device "

And then you’ve gotta register it as a thing that’s running, so Mycroft can pause it, and stop it, and disambiguate with other things that are playing stuff…

Parsers are hard, mang. I’d much rather work on Lingua Franca, where somebody already wrote the spec hundreds of years ago =P