Different behaviour Keras / Tensorflow Model (precise-listen vs. live)

Hey mycroftees,

i’ve (incrementally) trained some Keras model on the mycroft-precise dev version with quite an extensive load of material with a result of 97% accuracy.

Tested it with precise_listen with the expected outcome just dinging on “samira” (ww) or “samira”-ish words with 100% accuracy. Tested it against TV and other common noises around here. Everything as expected so far.

But with the implementation of that model things turned somewhat up-side-down. The recognition on “samira” is about 10% and the church bell triggers it with
an incredible accuracy.

Triple checked the config, which is in line with the ones given in the docs. The wake word is set to “samira” (mycroft_cli_client). And from what I can tell the voice.log not indicating some major problems.

2020-07-25 17:34:01.048 | INFO     |   706 | mycroft.client.speech.listener:create_wake_word_recognizer:323 | Creating wake word engine
2020-07-25 17:34:01.050 | INFO     |   706 | mycroft.client.speech.listener:create_wake_word_recognizer:346 | Using hotword entry for samira
2020-07-25 17:34:01.052 | WARNING  |   706 | mycroft.client.speech.listener:create_wake_word_recognizer:348 | Phonemes are missing falling back to listeners configuration
2020-07-25 17:34:01.054 | WARNING  |   706 | mycroft.client.speech.listener:create_wake_word_recognizer:352 | Threshold is missing falling back to listeners configuration
2020-07-25 17:34:01.060 | INFO     |   706 | mycroft.client.speech.hotword_factory:load_module:403 | Loading "samira" wake word via precise
2020-07-25 17:34:03.368 | INFO     |   706 | mycroft.client.speech.listener:create_wakeup_recognizer:360 | creating stand up word engine
2020-07-25 17:34:03.371 | INFO     |   706 | mycroft.client.speech.hotword_factory:load_module:403 | Loading "wake up" wake word via pocketsphinx
2020-07-25 17:34:03.672 | INFO     |   706 | mycroft.messagebus.client.client:on_open:114 | Connected
2020-07-25 17:35:51.805 | INFO     |   706 | mycroft.session:get:74 | New Session Start: b49e73cc-c657-4611-8934-7a049a81546c

A couple of questions regarding that log:
Why is pocketsphinxs’ wake word loaded (since none is set up in the conf)? Fallback?
And why are phonemes and thresholds missing? (pocketsphinx stuff?)
(edit: samira.pb.params: "threshold_config": [[6, 4]], "threshold_center": 0.2)

What has gone so wrong that caused the live implementation to be that inaccurate? Is there a known problem with dev? Should i step back to master?

How much data? How much wake word, not wake word? 97% means there’s some misses on your dataset, did you reinforce those items? Do the noises that are triggering the tf model trigger it in testing as well?

I know there’s some room for improvement in that regard and i think i know how to solve that.

What’s the problem here is that the (live) Tensorflow model behaviour is completely off compared to the accuracy precise-listen indicates.

How are you testing each?

OK, i tested .net with precise-listen - statistics from precise-test - (the pb only in live conditions).Now you’ve asked i tested .pb also with precise-listen and it behaves similarly.

Would be nice to know were the loading/implementation problem stems from.

Most likely it’s a sensitivity issue. Just forgot to add the custom (.8) sensitivity to the hotword entry

Are you testing on the same host you’re using it live on?

Yes, to reduce uncertainty. And there is plenty for those who are not knee deep into this to be frank.

The model is trained at .8 sensitivity, i just forgot to adress that. But in that departement i still have to figure out how the threshold is altering the overall sensitivity of the listener.

And then back to the drawing board filling those gaps of (live) false postives. Unluckily the german database is not nearly as good and despite running against multiple Gigs of phonemes, common voice data and noises there are huge gaps - and logically apeaking these should have absolutely nothing to do with any sensitivity. The statistical number (even as high as 99%) is just as good as the data thrown at it.

To be clear, this is far from being resolved, even with the sensitivity added in subsequently. Made another model, switched it and switched back and experienced a lot of silence (e.g. not recognizing the wake word) resp. triggering on planes, ventilators etc. whereas mycroft-precise (precise-listen) is pretty much on point.

That sounds more like data is the issue.

What version of precise are you training on?

Has anyone else gotten around this yet? I went through the steps in the mycroft documentation. The freshly trained wake word did work, but also tripped on almost every noise it heard, as to be expected. Training it against the public domain sounds to cut out the false positives yielded a wake word that did not work at all. I created a new wake word again, from scratch, and followed the tips section to fully train it (e 300 b 5000 s .8). After training it against both the public domain sounds and the google sounds, I have a 99.84% model on precise-test, and it works flawlessly in precise-listen. However, I have yet been able to get it to trip once I run mycroft with the pb file. If I change the file name in mycroft.conf to the mycroft.pb file, the mycroft wake word works. I think I am seeing the same thing as SGee.

What version of precise are you training on?

0.3.0 cloned from github

See if you can check out the 0.2.0 version and train with that instead. That’s what mycroft-core is still using at the moment.

Thanks. I did try checkout the 0.2.0 version, but it fails due to tensorflow 1.8.0 no longer being available. I downloaded the new precise-engine to .mycroft/precise/, renamed the old precise-engine directory, and unzipped the updated engine there, and that did correct the issue.

1 Like

You can use 1.13 on .2 without issue.

Mycroft could probably do with something like the Linto HMG (Hotword Model Generator)

Creating models is very much about the quality of the dataset and after playing, became to really like the layout of the HMG.

After you create your model you can do tests and validation and the false positives and negatives it lists and you can click to play each wav sample.
Instantly answer why did that fail and often its extremely apparent.

Until that point I had presumed the Google command set was pretty good, but due to HMG and listening to almost 8% that where just extremely bad trims where most of the hotword is clipped.
I thought that Google command dataset was validated but wow its not true.

Its an old adage with computers but especially true with datasets as input garbage, can drastically effect output.
Non hotwords have lesser effect but you will be surprised how what can sound like crap can have similar phones.

After deleting a shed load of hotwords and some non-hotwords that kept appearing I massively increased the model accuracy.
The number of samples even for HWS/KWS can be quite a manage and a desktop tool like HMG can be a godsend.

Don’t assume your dataset is good and if you find bad definately remove them from your dataset.

I found HMG quite interesting to how completely wrong samples can heavily weight accuracy the wrong way.

Something like HMG would be a great plus to Mycroft and also a user based web interface to share distributed datasets with some peer review method.
There are loads of ASR datasets but word datasets are a rarer beast and all we need to do is share Mydrive/Googledrive datasets in a distributed database that can be queried by language, region, gender, age and word(s).

As disussed in the the Dev Syncs there soon should be an update of the selene UI with such functionality - though being unclear about the scope . Therefor i stopped bothering with the custom WW for now to take a closer look at the (german) skill implementation.

But the HMG is a good call, will test this in my Arch-VM, since it’s the only instance running TF 2 (there is a pre dev mycroft-precise PR regarding tf>2.0 btw)

The _val % are dubious at best and only as good as your test sample i each category, which quality is hard to determine. If trained incrementally the split between nonww/test-nonww is made regarding ratio, not quality - most of the time arranging cascaded snippets in one place.

Since you followed the same tutorial; trim the “silence” before the WW with sox (great source of false positive environmental sounds)

That said, my best results were with 0,6 sensitivity trained and 0,55 set in the conf… (Oh, and the safe best -sb flag during training)

1 Like

It would be great if a dev could clarify the SeleneUI precise functionality coming. Also i would love to hear from @Wolfgange about the matter disussed.