Thanks @JarbasAl . I’ve read el tocino’s guide and I have sunk a bunch of time into my training data already. I guess I’ll check those out if I just can’t get training to work with my existing data.
Ready made KW & !KW 10k of each, then same sets but 100k of each for good measure.
Formatted just need training the 1st or 2nd pair is your choice.
All wavs from MLcommons, if the don’t train and do as last time then its your setup not your dataset.
If they do train then its your dataset and you should get an idea of how a reasonable set should train.
They are not Big Data good but as good a ready made binary dataset, minus noise that you will find.
Found one problem: the wav files must all be in the top-level directories (wake-word, not-wake-word, test/wake-word, test/not-wake-word) during training. I must have missed that in one of the guides!
This seems like a way better starting point for my training / data cleanup:
precise-listen doesn’t work yet (it doesn’t seem to recognize when I say the wake word).
Update: adding all the speech_commands_v0.01 helped a bit… I am at “99.9%” with fewer false positives and fewer false negatives. precise-listen works sometimes! I’m a little suspicious of this USB mic through a VM setup, so I’m going to try this fresh model on my actual picroft.
You prob need to turn on Precise-collect so you can play back the KW it received which is a bit hard when not receiving any I guess.
Maybe it does have options to give max amplitude and avg of incoming wav, dunno as not a precise fan and don’t use it.
PS stop using audacity and manual editing and pip install sox and check Welcome to pysox’s documentation! — pysox 1.4.2 documentation as your latest results are not good.
Maybe being forced to do it programatically making sure length, format, sr and levels are right and not full of distortion, may highlight possible problems more?
You can always connect to your vm and arecord -D plughw:1 -V mono -r 16000 -f S16_LE -c 1 /dev/null or whatever device index is your mic and the -V will display a VU meter on the cli. It doesn’t record anything but is just a level check.
PS VM or container?
Use docker ps to get the name of the existing container
Why Mycroft don’t have a high quality preformatted dataset available which hey-marvin sufficed is a mystery to me but at least it served its purpose.
Maybe start again but switch and make a hey-marvin by adding more samples from family members.
Also just the text sentences to grab !kw from are as concise as you can get as they are ‘phonetic pangrams’ of nonsense sentences that have just about every phone & allophone in a single sentence.
=== Summary ===
299 out of 319
93.73%
3.23% false positives
12.75% false negatives
Prob me being a KWS snob but for me the above results are atrocious, even on the dataset I posted I would still be drilling down and cleaning when 99.88%.
The KW where the only remnants a harddrive mishap of formatting the wrong one and losing loads of work as would of had a much cleaner !kw than just grabbing a selection fro MLCommons for you, but hey. Likely on my own models with my own dataset I would get 100% which actually means very little as your feeding what its just been trained on and its should be near 100% as it should of heard those already as in a real life environment with conditions and input it hasn’t been trained on things will just get worse.
Still guess it doesn’t matter as with a binary model such as Precise when you ran Hey-Marvin KW & !KW it was just voice samples and the cross entropy was quite high so the input KW had to be quite close to the KW label.
Because any single label has no memory of any individual elements just a graph of its input adding !voice will lower the cross entropy and move further away from KW so making it less accurate.
Its the catch-22 of a binary model such as this and trying to place so much variance in a single label and so little variance in another.
So if you are happy with that go with the flow and just find out what signal precise is getting from your mic and if that is good enough for you its good enough for you.
Sparky recommends recording via audacity but to be honest when you hand manipulating that qty of files you are just bound to get something wrong and that is where sox comes in for post processing and augmentation.
I am still kicking myself for losing all my scripts as its somehow more painful to dev them a 2nd time even though can remember the gist.
I am making a model of my own in the next couple of days and will be starting with those scripts again so will share when I have them done as with a sulk been dodging it.