Custom wakewords R Us!

baconator · August 13, 2019, 5:20am

Wanting to try a custom wake word but not sure you can get a precise model made? Having problems getting a model built that works as expected? Just need some more data for your wake word?

Precise-Community-Data may be the place for you! It’s a place to build a community-sourced dataset, oriented specifically towards custom wake words.

For a limited time only, and for the simple price of uploading your wakewords, I’m doing automated custom builds of precise wakewords. Get your wake words and some not-wake-words uploaded, and I’ll build models for them over the following week (probably less). They’ll be automated builds based off your wakewords plus all the not-wake-word data I can use (google words, public noises archive, PCD nww’s, etc). No guarantee on the quality of the end model, they usually turn out high 90’s recognition percentage.

Feel free to post questions here, or upload data at the repo. We’re happy to start accepting more, and hope to build a much larger dataset for everyone to make better models from.

PAQ:

How many words do you need to upload?
For each wakeword at least 20. The more, the merrier.

Do you need to upload not-wake-words?
Yes…they’re at least as useful as wake words, and if you do targeted nww’s, even more so.

I don’t want to upload my voice since it might get recognized from my account, though.
Then don’t. Or make a secondary git account and upload from it. Or a tertiary one. Lots of ways to obfuscate things if you want. Most people and so far computers can’t accurately distinguish a very short sample of speech without additional context.

I don’t have a github account/know how to use git?
You should get one! The general usage isn’t too difficult. On the off chance you have some real issue with this, send me a message over on the chat system.

I don’t want to upload under public domain
there’s a couple other license types you can explore, but mainly the creative commons licenses that allow for derivative usage would be best.

gras64 · August 13, 2019, 6:53am

Hello. If anyone is interested I work on a skill for my own wakewords. the goal is to be able to improve wakeword over the time of use and upload automatically. If you would help meet me on chat.

gez-mycroft · August 13, 2019, 7:24am

That’s a very generous offer, thanks baconator!!

Hope everyone makes the most of this while it’s available!

malevolent · August 13, 2019, 7:48am

Great!!!

I guess this just covers english phonemas and so, isn’t it?

baconator · August 13, 2019, 8:32am

If they’re in non-en languages, you can upload them and i’ll give it a whirl.

malevolent · August 13, 2019, 10:50am

Great!!

Then I will try this with “computer” and/or “house” in spanish.
As I’m a complete newbie on this, I have some very basic questions…

· I need to record the wake words in clips, the more records, the better.
· Do you recommend any audio program to record the words?
· As per Athena example, I noticed you recorded the wake word with several distances, and with different people. I will try to do the same, and making no difference on its naming convention.
· Once I have the set of records recorded on the specified format, I need to fork your repo and make a PR, sending you the wav files on the structure you need and with the naming convention you want.
· I just need to put the wav files on the wake-word/lang-short/ directory, besides the README.md and Licenses.
· While it seems obvious what to do with the noises folder (I could record a cough, a siren, etc, and then relation it with a description through the metadata.csv file) I have no idea what to do with the lang-short not-wake-words, as there is no example at the moment, perhaps telling some random words? Record the TV?

baconator · August 13, 2019, 11:06am

Don’t submit copyrighted stuff, please.
for recording, I had a bunch of saved wake words from my picrofts, as well as a few from manually recording. I tended to use arecord -d 3 asdf.wav a lot.

Yes, variations in distance, speaker, speed, and inflection all help improve how robust the model can be.

For athena, some of my nww’s are words like athlete, christina, or gasoline. They were developed as I got false positives from my picroft over a few weeks (and from background tv noise, but I recorded my own words instead). They’re getting the metadata.csv completed and will be uploaded soon. See the targeted nww link in first post for more. NWW’s should also NOT be existing wakewords. If anything those should be submitted to the relevant directory!

malevolent · August 13, 2019, 3:13pm

Heheh, don’t be afraid, I’m releasing my voice to any public license, I really love WTFPL, so I guess I will release them under that license

The only copyrighted stuff could be on the noise part if I have the TV turned on, that would be ok? If is not ok, well, I guess a siren, a truck or a blender, sounds similar in your country like in mine

arecord you say? fine, so the command line for spanish woul be exactly this:
arecord -f S16_LE -r 16000 -d3 wakeword-es-$(uuid).wav

On Thursday I’ll take some days out of work and will make some recordings.

baconator · August 13, 2019, 3:33pm

looks good. For debian the package uuid-runtime has uuidgen. A more motivated scripting type would hook all that together for ease of use, haha.

malevolent · August 15, 2019, 12:48pm

just PR’ed the wake word.
on the phoneme’s README.md I wrote
EY OR DE NAA DOR
but I really don’t have a clue how to write it properly. Under cmusphinx is written like
O R D E N A D O R
But with the dict in spanish, so… I will let people with more knowledge how to transcribe it…

PS: I’ve just realized I need to create the metadata.csv for the not-wake-words. I’m going to do it right now.

baconator · August 15, 2019, 6:19pm

Ok, PR with 2 models pending (precise .2 and .3). I used a partial subset of the nww data I have, since there’s not as comparable a dataset yet, but still got to high .99s in training. Should be merged by tomorrow at the latest.

malevolent · August 20, 2019, 11:34pm

so… any luck with the spanish wake word @baconator?

baconator · August 21, 2019, 3:22am

There’s a pending PR with the models if you go peruse the repo!

baconator · September 5, 2019, 4:38am

PR no longer pending! Sorry about the delay. Hopefully you were able to pull it down before now.

malevolent · September 5, 2019, 8:55am

I don’t get it… pull down what?

baconator · September 5, 2019, 2:34pm

The formerly pending PR. It has been merged now, so it is part of the main repo.

malevolent · September 5, 2019, 8:15pm

hoorah!!!

jodynickel · October 15, 2019, 6:19pm

Is this still available? I’ve been trying to develop my own custom wakeword using the precise toolchain but am getting a series of exceptions and am blocked from moving forward. I’ve posted the issue to the forum but haven’t seen any responses as of yet. I can record and train with the precise tool chain, but precise-listen and precise-convert fail.

How much training data is needed. I have about 30 wakeword recordings, and 10 or so non-wakeword recordings. I can certainly make more if that will significantly improve the accuracy.

JarbasAl · October 15, 2019, 7:41pm

if you submit your samples to the repo @baconator will probably train a model for you

data is a big issue, 30 are not enough, you will need more recordings

EDIT: 30 might work ok, at least for your own voice, but i would say to use at least 50 /EDIT

for not-wake-word if you can get rhyming words etc will probably also help

other thing to keep in mind, if your samples are there more people are likely to submit their own samples for your word, even if you do not have enough data it is a good idea to submit it

baconator · October 15, 2019, 7:54pm

yes.

And if you upload a bunch I’ll probably add 20 wake and a few not-wake-words as well.