Girlfriend can't wake Mycroft

Hi all,

So for some reason, my girlfriend can’t wake Mycroft. I’m running the latest picroft image, with the following microphone:

Jabra Speak 410 Corded Speakerphone for Softphones - Easy Setup, Portable USB Speaker for Holding Meetings Anywhere with Outstanding Sound Quality https://www.amazon.com/dp/B007SHJIO2/ref=cm_sw_r_apanp_FE96RRKqzPI49

Seems to work fine for me. Haven’t had time to check the CLI logs when she’s trying to wake it yet. I’ll add those as soon as I can. Mainly wondering if anyone else has experienced a similar situation. She’s even tried yelling with no luck.

2 Likes

Not even my girlfriend’s voice, could it be a matter of vocal timbre?

Since I wanted to use another wake word (i.e. “Hey Doc”) it wasn’t a problem

If you want/can use another wake word check this link Using a Custom Wake Word - Mycroft AI

There is also a repo with others wake-words already trained GitHub - MycroftAI/Precise-Community-Data: Pre-trained Precise models and training data provided by the Mycroft Community

1 Like

We really need more data for those to be generally useful.

Ideally, a wake word portal that collects and validates community wake words should be created. It’s not a priority for the team, of course, but hopefully someone is able to make one soon so everyone can benefit.

2 Likes

Hi all,
I had exactly the same problem when I started out with Mycroft, the women in my family couldn’t activate Mycroft reliably.
Changing the wake word model to this beta model (precise-data/hey-mycroft-001 at production_models · MycroftAI/precise-data · GitHub) improved the successful “wake word activation rate” by a lot for me. Give it a try, maybe it also improves your situation.
Use the tutorial posted by @ale: Using a Custom Wake Word - Mycroft AI
For me a sensitivity of 0.3 and a trigger value of 6 have worked very well.

Hope I could help!

3 Likes

Excellent…!!!
It should be activate reliably.

1 Like

So, just to report in after some testing:

I’ve made the adjustments that were suggested, and it has helped some. Mycroft will now respond to her…but only if she does a bad English accent lol. It doesn’t generate any logs that I’ve been able to find when it doesn’t activate and will sometimes wake at odd moments.

Same issue here. I’ve noticed this with all women in my family. @gustavmustermann’s solution helped a bit. I have mycrofts running on a RPi 4 (custom debian install and not picroft) as well as one running on Odroid c4 both with Jabra 410 mic/speakers.

As @baconator mentioned, a portal to collect wake words and expand/improve the models would likely help. Opt-in service via enable/disable switch or voice-command. Option to enable/disable manually or for the next n-Minutes/hours and state clearly what the collection is for.

2 Likes

Another +1. My wife couldn’t wake mycroft without pitching her voice down to imitate a man. I’ll try out the beta but I wanted to chime in and say I hope when Mycroft 2 ships that they’re not planning on leaving out half the population of the world :sweat_smile:

I’ve had excellent success with training a custom model. (And shameless plug here for some guides I wrote to help you do that!) The biggest hurdle was convincing her to sit down and say the wake-word over and over again, which she does not really see the point of.

Thing is, systems like Alexa or Siri have terrifically gigantic userbases that are constantly being fed massive, stupefying amounts of data from across the voice spectrum, whereas projects like this one tend to - when it comes to human-machine interaction, anyway - be inherently biased toward the userbase that creates them. This makes sense and isn’t a moral judgement - bias happens in all systems - but I’ve wondered about how we could encourage a large amount of people from across the voice / accent spectrum to contribute samples. It’s a large undertaking…and it depends on what you’re trying to train the model to do.

I don’t understand the internals of how precise works, so I don’t know if it would better, say, to have models trained to hear one wake word across a variety of accents and voices, or if there’s a point (and there almost certainly is) where the variation in pronunciation starts to make the model less, well…precise. (Pun intended)

I’m sort of babbling here, and maybe it doesn’t matter so much for pitch vs pronunciation, but all this to say, at the moment, training a custom model may be the right way to go for you.

2 Likes

There is Mlcommons which is ongoing where they are trawling ASR and extracting words but the growing dataset is here :-
Multilingual Spoken Words | MLCommons I am staying clear as my initial comments of WTF with so many big names why does it contain so many errors, bad distribution and why have they put the words in 1 sec silence wav’s as most of the download is silence whilst that is much easier to add than take away.
Its still a great resource and hopefully it will just get better.

If you try this https://drive.google.com/file/d/1EFT4T0sxyVo9EXWMh-V0BL4QWXAFfVlE/view?usp=sharing

The simple python script runs tflite KWS model and it captures the KW on hit and to be honest with privacy as long as its opt in rather than opt out like big data why KW is not captured and shared is a confusion.

I have noticed a lot of problems are due to audio the input volume on many mic / hardware is very low and sometimes people wack up the volume to the max which is a really bad idea as there is no headroom and clipping just sends a load of resonant spectra a bit like a distortion pedal.

A lot of problems are down to audio and we don’t really get much feedback as don’t assume what pulseaudio or alsa purports to say is actually being received at the KWS.
Again that script above has the max volume received on a ‘print()’ output

AGC is a must and don’t set your input above 70% so at least you have some headroom and let software or hardware AGC do the rest.
Set up a /etc/asound.conf or ~/asoundrc

#pcm default to allow auto software plughw converion
pcm.!default {
  type asym
  playback.pcm "play"
  capture.pcm "cap"
}

ctl.!default {
  type hw card 1
}
ctl.equal {
  type equal;
}
pcm.plugequal {
  type equal;
  slave.pcm "plughw:1,0";
}
pcm.equal {
  type plug;
  slave.pcm plugequal;
}

#pcm is pluhw so auto software conversion can take place
#pcm hw: is direct and faster but likely will not support sampling rate
pcm.play {
  type plug
  slave {
    pcm "plughw:1,0"
  }
}

#pcm is pluhw so auto software conversion can take place
#pcm hw: is direct and faster but likely will not support sampling rate
pcm.cap {
  type plug
  slave {
    pcm "plugequal"
    }
}

pcm.agc {
 type speex
 slave.pcm "cap"
 agc on
 agc_level 2000
 denoise off
}


#sudo apt-get install asound2-plugins
defaults.pcm.rate_converter "speexrate"

Not sure if there is still the mismatch between the version of asound2-plugins and speexdsp as automatically recompile but if you haven’t got hardware agc use the speex plugin.
Also because generally I am lazy as will eventually get round to finding a lite bandpass filter but just use the alsa equalizer. With the above capture.pcm “cap” for direct capture.pcm “agc” to have software agc.
Often though with low input false positives is the big problem not false negatives.

I would say by the look of things the initial model had a gender bias in the dataset.
Have a play with the above in a venv as being able to see what is going on might give you some feedback its ‘Hey Marvin’.
See how you go as I feel many training methods are pretty flakey and there are much better methods.

[EDIT] whoops posted the wrong file but yeah give the updated a go not previous :slight_smile:

The simple python script runs tflite KWS model and it captures the KW on hit and to be honest with privacy as long as its opt in rather than opt out like big data why KW is not captured and shared is a confusion.

https://imgur.com/a/fTSA1Tn

I had no idea about this particular open dataset, and I did some thorough searching at one point for such a thing. Sad if the quality is as low as you say it is, though. Silences in files…one of the reasons I stop using precise-collect.

If you read the names and backers:- Mark Mazumder Harvard University, Yiping Kang University of Michigan, Juan Ciro Factored/MLCommons, Keith Achorn Intel, Daniel Galvez NVIDIA, Mark Sabini Landing AI, Peter Mattson Google, David Kanter MLCommons, Greg Diamos Landing AI, Pete Warden Google, Josh Meyer Coqui and more…

Yeah I have been a bit disappointed but its not a set in stone dataset and its a dataset of unparalleled resource that the English subset has over 38,000+ classes.
Its just the short words with some specific words that the forced aligner struggles with can contain quite a lot of bad which if you have just created a model on then you should be able to use that model as a filter and clean your dataset.

Its a hugely important initiative for KW https://datasets-benchmarks-proceedings.neurips.cc/paper/2021/file/fe131d7f5a6b38b23cc967316c13dae2-Paper-round2.pdf that has many who maybe should be involved absent…

The point of Mlcommons is to provide a singular dataset and it needs the opensource community to join as the size of the herd is important and a little by many can produce massive results.

Pitch & Pronunciation as absolutely critical as they create very different spectra as KW is just a MFCC image of a time frame representing spectra.
I did a dataset builder that has an on screen kiosk to prompt for words to create a custom dataset and uses a collection of phonetic pangrams (nonsense sentences that contain as many phones and allophones in a sentence)
GitHub - StuartIanNaylor/Dataset-builder: KWS dataset builder for Google-streaming-kws it builds great models but will only recognize those involved in the kiosk data collection.

So even though a bit flakey Mlcommons is invaluable to me because of the wealth of words and if its to be the base or just to supplement a custom model there is no other resource with this qty of speakers or words.

It does need involvement and all these fragmented half hearted attempts just need to be relegated to the dust of time.
The Google command set is a benchmark dataset and if by design or not does contain much bad and I thought Mlcommons would be better and many of the longer words are but lacking in qty.
Its not that bad but with the big hitters involved I did think it would be better.