I’m running latest PiCroft in a Google AIY kit V1. It’s working pretty great so far and I love it. That said it’s not perfect. I searched and couldn’t find this issue addressed though there are somewhat close subjects already.
One of the issues I’ve noted is that if I’m listening to podcasts or the TV is playing then it’ll catch my wake word request just fine. But then it seems to have trouble parsing my actual request from the background noise. I assume this is because “hey mycroft” has some kind of ML process making the software better at filtering for wake word vs user’s request. So I got to thinking about a solution to this. I hope you’ll forgive my ignorance on some of these subjects and maybe folks are already working on this or perhaps it’s already happening but constraints keep it from being particularly effective.
I know training for a particular voice is a thing in most other assistants which, unless I’ve missed it, Mycroft doesn’t seem to allow currently? While that would serve as a solution it creates new problems in that other people can’t interact with Mycroft readily.
So what then? I’ve used software like Audacity that lets me take a sample of audio and use that clip as a filter to remove that noise from the rest of a clip. It seems like this but opposite might be a solution for Mycroft. Now I grant there may well be hardware, computational, or time limitations that would prevent this from working but I don’t have the depth of knowledge on this subject to know. So the wake word could be recorded and some algorithm used to create a sort of filter, along the lines of the Audacity filter but in reverse to keep only the voice that spoke the wake word and filter the rest. Maybe this isn’t do-able with the resources on a pie?
Or perhaps some hybrid. Sample background noise pre-wake word, sample the wake word, use the background noise in combination with the wake word sample, to filter just the voice that made the wake word request.
This wouldn’t even need to be real-timey all the time. I’d imagine after a few wake word samples from the same user you could get some fast and local model of the user’s voice which could then be quickly filtered from background noise which is what I assume other assistants that can be trained do pro-actively in the training process.
Is this even feasible? It would be nice to not have to mute whatever content I’m listening to and still be able to reliably interact with Mycroft.
I know I might be asking for currently technically infeasible miracles so if I am I’m sorry. I do very much appreciate all the hard work by the Mycroft team and the community. Thank you!