If you redo the training for Hey Mycroft with your own voice does it improve recognition?
Yeah we’re working on a system for any user to be able to easily train their local wake word model to their voices.
If you only care about a small number of users and can get samples to train from each of them the resulting model will probably be terrible at picking up others, but great at handling that handful of people.
I was just wondering about the ‘hey mycroft’ or any dataset that if you train with your own voice what sort of improvement does it give Gez?
I was looking at the 8/4 split of 12 recordings and thinking if that was all that is needed.
But yeah great your working on it and I was also wondering how much does the sample qty add to the accuracy.
I was jusy wondering if Mycroft should record near threshold misses and have a routine where you yeah or nay them on playback and then Mycoft might use them to add “is wake word” and “not wake word” collection and maybe train over night a new model with additions?
I presume what everybody does is record wake words in a perfect silent setting and then have Mycroft working in an environment where there is much noise from various devices.
I’m currently doing a lot of work on this. The data pipeline for “Hey, Mycroft” got borked when we moved to Selene and we haven’t been able to find time to resolve the issues.
To be honest, I don’t know if a dozen samples from the individual will improve the training or not. One of the tasks on my agenda right now is building metrics around Precise and then comparing them to off-the-shelf systems like Picovoice, Snowboy, etc.
During training we do insert noise into the samples. We also got data from a wide range of noise environments on a wide range of microphones.
Once I get a handle on the data pipeline, installation and some of the “magic numbers” in the code I’ll have a better idea where we can make improvements.
I suspect that there is a way to pre-train the models on the generic dataset, then improve the model by incorporating data from the individual using the current device, but I’m not quite there yet.
I’ll have better answers in a week or so ( I hope ) barring distractions.
https://github.com/FunctionLab/selene neural networks and ML I have tried and dunno if its just me but there seems to be a disconnect from the examples to what you try and implement yourself.
The finetuning and hyperparameters by the source developers seems to be an arcane art rather than a programmatic process that has me holding my hands up in surrender.
I have noticed with “Hey Mycroft” that a few times I have actually forgot “hey” and been surprised that “Mycroft” could actually trigger.
From playing with Google and Amazon I think two word wake tend to have more human error involved.
“MycroftAI” might well be a better wake word than “Hey Mycroft” as us dumb humans can often pause far too long between words.
With “Hey” there is tendency to put a longer pause in the wake word than between wake word and intent.
I presume samples from the actual control voice do improve training, but also noise from the control voice environment would also.
I presume 12 is a ridicously low number of samples in terms of training?
Hence why I wondered if Mycroft could store actual usage and build up a much larger training set without an initial labourious long winded training session.
Maybe play back a near miss session and seperate in good/bad sets with a yeah/nay reply to each wake playback.