For Those Having Trouble Training a New Wake Word

All the sample files have a .wav extension.

I’ve never seen riff come up in format discussions before, so I’m sort of singling that out as sus, as the kiddos say.

I used ffmpeg in the same Ubuntu 18.04 VM I’ve been using for training and testing. :man_shrugging:

You’re welcome to upload some of your data somewhere and I can try to look at it when I get a chance.

Oh sure, awesome! Ok, I’ll message you a link.

Edit: I’m just a hobbyist too, don’t worry. If I can get it, you can get it. There’s just something goofy going on here.

Thanks, I needed this!

You can try these hey-marvin 10k KW
https://drive.google.com/file/d/1v8xfbj7bILcHn8KkTSs6G-twd3glnsbo/view?usp=sharing

hey-marvin2 100k KW
https://drive.google.com/file/d/1lm-wKZAYIGKQxtm2t6Vkrlv935ccJJfP/view?usp=sharing

Might not be the KW you want but gives you a datum to get used to training.

notkw-10k
https://drive.google.com/file/d/1V1rMgZKTUItZbzat_ZfDArg493kCkL0B/view?usp=sharing

notkw 100k
https://drive.google.com/file/d/1rxdzAS1t49tYvv8GkdwOzvamhdu3XMIC/view?usp=sharing

util-scripts
https://drive.google.com/file/d/18DMNdrwBUXw3lgDysoarfgw94rublGGx/view?usp=sharing

https://storage.googleapis.com/public-datasets-mswc/audio/en.tar.gz

For starters you should read the precise tips from eltocino, one of the first guides that has been used and validated by the community over the years

recently we got these cool repos i havent tried yet, but i know of people having great results with these

and

What are these? What are you suggesting? Sorry, I’m totally lost.

Thanks @JarbasAl . I’ve read el tocino’s guide and I have sunk a bunch of time into my training data already. I guess I’ll check those out if I just can’t get training to work with my existing data.

Ready made KW & !KW 10k of each, then same sets but 100k of each for good measure.
Formatted just need training the 1st or 2nd pair is your choice.
All wavs from MLcommons, if the don’t train and do as last time then its your setup not your dataset.
If they do train then its your dataset and you should get an idea of how a reasonable set should train.
They are not Big Data good but as good a ready made binary dataset, minus noise that you will find.

I used these. I put 80% in wake-word and 20% in test/wake-word.

I used these and did a similar 80/20 split between not-wake-word and test/not-wake-word.

Worked. Perfectly. Dang! I guess the problem is with my data. Thank you! Back to the drawing board.

$ precise-test hey-marvin.net hey-marvin/ 2>/dev/null 
Loading wake-word...
Loading not-wake-word...
Data: <TrainData wake_words=8000 not_wake_words=9108 test_wake_words=2000 test_not_wake_words=2277>
=== False Positives ===
hey-marvin/test/not-wake-word/2819015076embroideredcommon_voice_en_18751486.wav
hey-marvin/test/not-wake-word/2827423258theobaldcommon_voice_en_18984515.wav
hey-marvin/test/not-wake-word/2606411025indianapoliscommon_voice_en_19116959.wav

=== False Negatives ===
hey-marvin/test/wake-word/c913d758-e6d2-4a35-b930-e0ebd11656f1cats.wav
hey-marvin/test/wake-word/d367a887-994e-449d-987e-22b36a5099dbcatstb.wav

=== Counts ===
False Positives: 3
True Negatives: 2274
False Negatives: 2
True Positives: 1998


=== Summary ===
4272 out of 4277
99.88%

0.13% false positives
0.10% false negatives
1 Like

Found one problem: the wav files must all be in the top-level directories (wake-word, not-wake-word, test/wake-word, test/not-wake-word) during training. I must have missed that in one of the guides!

This seems like a way better starting point for my training / data cleanup:

$ precise-test custom-hey-mycroft.net custom-hey-mycroft/ 2>/dev/null 
Loading wake-word...
Loading not-wake-word...
Data: <TrainData wake_words=421 not_wake_words=885 test_wake_words=102 test_not_wake_words=217>
=== False Positives ===
custom-hey-mycroft/test/not-wake-word/room-noise-03-46.wav
custom-hey-mycroft/test/not-wake-word/a-01_S-25.wav
custom-hey-mycroft/test/not-wake-word/e-01_S-4.wav
custom-hey-mycroft/test/not-wake-word/room-noise-03-20.wav
custom-hey-mycroft/test/not-wake-word/m-01_S-12.wav
custom-hey-mycroft/test/not-wake-word/m-01_S-47.wav
custom-hey-mycroft/test/not-wake-word/room-noise-03-48.wav

=== False Negatives ===
custom-hey-mycroft/test/wake-word/a-02_S-11.wav
custom-hey-mycroft/test/wake-word/a-01_S.wav
custom-hey-mycroft/test/wake-word/d-02_S-23.wav
custom-hey-mycroft/test/wake-word/a-02_S-10.wav
custom-hey-mycroft/test/wake-word/a-02_S-36.wav
custom-hey-mycroft/test/wake-word/m-01_S-37.wav
custom-hey-mycroft/test/wake-word/a-02_S-26.wav
custom-hey-mycroft/test/wake-word/a-01_S-11.wav
custom-hey-mycroft/test/wake-word/m-02_S-10.wav
custom-hey-mycroft/test/wake-word/a-01_S-54.wav
custom-hey-mycroft/test/wake-word/m-02_S-4.wav
custom-hey-mycroft/test/wake-word/m-02_S-12.wav
custom-hey-mycroft/test/wake-word/a-01_S-4.wav

=== Counts ===
False Positives: 7
True Negatives: 210
False Negatives: 13
True Positives: 89


=== Summary ===
299 out of 319
93.73%

3.23% false positives
12.75% false negatives

precise-listen doesn’t work yet (it doesn’t seem to recognize when I say the wake word).

Update: adding all the speech_commands_v0.01 helped a bit… I am at “99.9%” with fewer false positives and fewer false negatives. precise-listen works sometimes! I’m a little suspicious of this USB mic through a VM setup, so I’m going to try this fresh model on my actual picroft.

You prob need to turn on Precise-collect so you can play back the KW it received which is a bit hard when not receiving any I guess.
Maybe it does have options to give max amplitude and avg of incoming wav, dunno as not a precise fan and don’t use it.

PS stop using audacity and manual editing and pip install sox and check Welcome to pysox’s documentation! — pysox 1.4.2 documentation as your latest results are not good.
Maybe being forced to do it programatically making sure length, format, sr and levels are right and not full of distortion, may highlight possible problems more?

You can always connect to your vm and arecord -D plughw:1 -V mono -r 16000 -f S16_LE -c 1 /dev/null or whatever device index is your mic and the -V will display a VU meter on the cli. It doesn’t record anything but is just a level check.

PS VM or container?

Why Mycroft don’t have a high quality preformatted dataset available which hey-marvin sufficed is a mystery to me but at least it served its purpose.
Maybe start again but switch and make a hey-marvin by adding more samples from family members.

I did a rough and ready recording ‘boutique’ a while back GitHub - StuartIanNaylor/Dataset-builder: KWS dataset builder for Google-streaming-kws
That prompts on screen and the code there for augmentation and stuff will give you a load of tips.

Also just the text sentences to grab !kw from are as concise as you can get as they are ‘phonetic pangrams’ of nonsense sentences that have just about every phone & allophone in a single sentence.

Not good how?

VM. I followed GitHub - sparky-vision/mycroft-precise-tips: mycroft-precise-tips , which recommends audacity over sox.

Seriously? Argggghhhh.

=== Summary ===
299 out of 319
93.73%

3.23% false positives
12.75% false negatives

Prob me being a KWS snob but for me the above results are atrocious, even on the dataset I posted I would still be drilling down and cleaning when 99.88%.
The KW where the only remnants a harddrive mishap of formatting the wrong one and losing loads of work as would of had a much cleaner !kw than just grabbing a selection fro MLCommons for you, but hey. Likely on my own models with my own dataset I would get 100% which actually means very little as your feeding what its just been trained on and its should be near 100% as it should of heard those already as in a real life environment with conditions and input it hasn’t been trained on things will just get worse.

Still guess it doesn’t matter as with a binary model such as Precise when you ran Hey-Marvin KW & !KW it was just voice samples and the cross entropy was quite high so the input KW had to be quite close to the KW label.
Because any single label has no memory of any individual elements just a graph of its input adding !voice will lower the cross entropy and move further away from KW so making it less accurate.
Its the catch-22 of a binary model such as this and trying to place so much variance in a single label and so little variance in another.

So if you are happy with that go with the flow and just find out what signal precise is getting from your mic and if that is good enough for you its good enough for you.
Sparky recommends recording via audacity but to be honest when you hand manipulating that qty of files you are just bound to get something wrong and that is where sox comes in for post processing and augmentation.
I am still kicking myself for losing all my scripts as its somehow more painful to dev them a 2nd time even though can remember the gist.
I am making a model of my own in the next couple of days and will be starting with those scripts again so will share when I have them done as with a sulk been dodging it.

Just adding those bits as I go along try the trim script from ProjectEars/dataset at main · StuartIanNaylor/ProjectEars · GitHub as a test on your kw dataset.
Also by time hopefully you look there will be an augmentation folder as well.