the premier open example of what you are talking about is surely Common Voice (https://voice.mozilla.org/) that is collecting validated open licenced recordings for Deepspeech (or any other engine). Users can record text snippets and vote on other users text recordings. Two positives without negative means inclusion of the recording in the data set.
Vital for Speech Recognition where a huge amount of data is required for good results. The Common Voice team are suggesting 10,000 hours of validated audio as a target.
With 1500 hours, the current English language model struggles for accuracy with names and less common words.
Possibly less so for a hotword engine where a small amount of data can be used to train a model for a single speaker although a general hotword model still takes a lot of data.
I see NLU as a different beast. My understanding is that an NLU model needs hand crafting to avoid overlap and balance accuracy vs features and can only be pushed so wide before the user experience suffers. Domain specific. Less is more. KISS.
Large data sets still help.
Audio to text is a one to one mapping. Converting a sentence into meaning is one to many depending on context. “Tell me more”… about what?
Feeding context with dictated text into an NLU model can help with overlap.
Google and Alexa have imposed switching context in requiring a skill name after the hotword. OK for voice apps that continuously engage but “OK Google, Ask meeka music to …” for music control is a pain.
RASA uses a secondary machine learning model based on example stories of sequential NLU intents and actions to select between scored possibilities from the NLU model.
Session context is also used in routing stories and decisions. An intent “stop” is interpreted by the news reader rather than the music player because the session remembers that it was the last used skill in the dialog history.
Mycroft and JOVO support explicit switching on and off intents using a state variable held by the skill server session. I recall Dialogflow had something similar.
There is also potential context from
- speaker identification
- location tracking
- my cloud data eg recent searches and messages and contacts
- a range of hotwords
- my social network
Maybe pessimistic about the contraints of NLU engines but I think Mycroft integrators will find they have to be careful with what skills they combine.
I’ve certainly seen plenty of applications start to fall over as I load up the plugins.
Your goal is a broad ecosystem of skills which will inevitably overlap. Curation/certification would help. Skill naming like Google/Alexa is a possibility. (or multiple hotword switching)
Session inclusive context like RASA, combined with other technologies like speaker identification would be helpful.
Lots of examples is probably the best medicine.
I think your suggestion of an open licenced repository of sentence/context to meaning maps would be a great resource for the voice developer community to source initial training data for their application domain.
A challenge is finding a format and tools so that central data can be converted to various NLU training data formats.
It would need to allow for context filters to be attached to training data.
Categorised by application. Music. Search. News. Weather. Joe’s BrainBender Skill.
Curation to flag overlap using additional context filters expecting that users will select their own training data as needed.
Include some tools to assist selecting from the data set, merging training examples with entity data using Chatito and converting to various destination NLU training formats.
From another angle, categorised lists are very useful in generating training data. A model can get by on few intent examples but good entity recognition requires significantly more data. Lists of animals, fruits, people names, movie stars or code for scraping that data would be very useful to improve intent recognition in a restricted selection of domains. Rhymezone shows how wordnet can be used to find related words based on frequency of closeness in an English corpus.
The RASA team have a github repository https://github.com/RasaHQ/NLU-training-data.
Scraping the mycroft community skills repository would certainly be a nice chunk of NLU training data.
Google put up lots of public domain examples for dialog flow.
There’d be heaps of starter material.
There is also lots of folks out there who have shared thoughts and code related to voice development in blogs and repositories.
Tools to encourage capture and collation from live systems like RASA-X generate lots of data that can be particularly valuable in providing sentence structures for intents that may not have been considered in generated initial data.
A website with search tools would be nice but a github repo would do.
README with useful links
chatito expand entities
- library of spoken responses in ?? format
- many files in json format that specify example, NLU parse, related contexts ??
- allow for entity expansion from scrape sources
joe's Brainbender Skill
readme with links to open data sources
- plain text lists of entity values
I’d have some time to feed an NLU sources repository if anyone else is interesting in owning and promoting it.