Mycroft Community Forum

Different behaviour Keras / Tensorflow Model (precise-listen vs. live)

Hey mycroftees,

i’ve (incrementally) trained some Keras model on the mycroft-precise dev version with quite an extensive load of material with a result of 97% accuracy.

Tested it with precise_listen with the expected outcome just dinging on “samira” (ww) or “samira”-ish words with 100% accuracy. Tested it against TV and other common noises around here. Everything as expected so far.

But with the implementation of that model things turned somewhat up-side-down. The recognition on “samira” is about 10% and the church bell triggers it with
an incredible accuracy.

Triple checked the config, which is in line with the ones given in the docs. The wake word is set to “samira” (mycroft_cli_client). And from what I can tell the voice.log not indicating some major problems.

2020-07-25 17:34:01.048 | INFO     |   706 | mycroft.client.speech.listener:create_wake_word_recognizer:323 | Creating wake word engine
2020-07-25 17:34:01.050 | INFO     |   706 | mycroft.client.speech.listener:create_wake_word_recognizer:346 | Using hotword entry for samira
2020-07-25 17:34:01.052 | WARNING  |   706 | mycroft.client.speech.listener:create_wake_word_recognizer:348 | Phonemes are missing falling back to listeners configuration
2020-07-25 17:34:01.054 | WARNING  |   706 | mycroft.client.speech.listener:create_wake_word_recognizer:352 | Threshold is missing falling back to listeners configuration
2020-07-25 17:34:01.060 | INFO     |   706 | mycroft.client.speech.hotword_factory:load_module:403 | Loading "samira" wake word via precise
2020-07-25 17:34:03.368 | INFO     |   706 | mycroft.client.speech.listener:create_wakeup_recognizer:360 | creating stand up word engine
2020-07-25 17:34:03.371 | INFO     |   706 | mycroft.client.speech.hotword_factory:load_module:403 | Loading "wake up" wake word via pocketsphinx
2020-07-25 17:34:03.672 | INFO     |   706 | mycroft.messagebus.client.client:on_open:114 | Connected
2020-07-25 17:35:51.805 | INFO     |   706 | mycroft.session:get:74 | New Session Start: b49e73cc-c657-4611-8934-7a049a81546c

A couple of questions regarding that log:
Why is pocketsphinxs’ wake word loaded (since none is set up in the conf)? Fallback?
And why are phonemes and thresholds missing? (pocketsphinx stuff?)
(edit: samira.pb.params: "threshold_config": [[6, 4]], "threshold_center": 0.2)

What has gone so wrong that caused the live implementation to be that inaccurate? Is there a known problem with dev? Should i step back to master?

How much data? How much wake word, not wake word? 97% means there’s some misses on your dataset, did you reinforce those items? Do the noises that are triggering the tf model trigger it in testing as well?

I know there’s some room for improvement in that regard and i think i know how to solve that.

What’s the problem here is that the (live) Tensorflow model behaviour is completely off compared to the accuracy precise-listen indicates.

How are you testing each?

OK, i tested .net with precise-listen - statistics from precise-test - (the pb only in live conditions).Now you’ve asked i tested .pb also with precise-listen and it behaves similarly.

Would be nice to know were the loading/implementation problem stems from.

Most likely it’s a sensitivity issue. Just forgot to add the custom (.8) sensitivity to the hotword entry

Are you testing on the same host you’re using it live on?

Yes, to reduce uncertainty. And there is plenty for those who are not knee deep into this to be frank.

The model is trained at .8 sensitivity, i just forgot to adress that. But in that departement i still have to figure out how the threshold is altering the overall sensitivity of the listener.

And then back to the drawing board filling those gaps of (live) false postives. Unluckily the german database is not nearly as good and despite running against multiple Gigs of phonemes, common voice data and noises there are huge gaps - and logically apeaking these should have absolutely nothing to do with any sensitivity. The statistical number (even as high as 99%) is just as good as the data thrown at it.

To be clear, this is far from being resolved, even with the sensitivity added in subsequently. Made another model, switched it and switched back and experienced a lot of silence (e.g. not recognizing the wake word) resp. triggering on planes, ventilators etc. whereas mycroft-precise (precise-listen) is pretty much on point.

That sounds more like data is the issue.

What version of precise are you training on?