I’m currently training my own custom wake word (WW) and I’m facing some issues:
1- Other words than the wake word are being recognized.
First, I only trained the WW model using the Public Domain Sounds Backup as negative examples (100 files) and 12 files containing the wake word as positive examples. This led the model to recognize almost every noise that contained voice in it, because the PDSB set of sounds is a collection of noises, not voice. Then, I added 18 more positive examples and the recognition got slightly better. Then I added 25 negative examples containing words that sound similar to my WW. The model got much better but stills recognizes many words that are not the intended WW.
Since the model is slowly getting better, I’m guessing that is just a matter of more training data. I would like to know if I’m on the right path or if the training should work much better with less data and I’m just training it wrong.
2- The model only recognizes the WW when it is pronounced very close to the mic.
I tried different mics and the WW ‘hey mycroft’ works really well even from afar. My model only works (if at all) when the person is at least 0,5m close to the mic. When I was recording the audio samples, I asked the persons to pronounce the word with a distance of 5cm from the mic. Could this be a reason? Is there a way to change the sensibility of the model from the code that trains it? I’ve read the code but I couldn’t find any parameter for that. Maybe I missed it?
It is my first time training a ML model, so any further information about this process is highly appreciated.