Mycroft Community Forum

Hermod voice suite

Hi All,
excited to see the the discussion around an open hardware design.
I recall a comment about not being a hardware company but that is the thing I want most from Mycroft.

I’ve been developing an open source voice dialog service suite https://github.com/syntithenai/hermod inspired by Snips and integrating RASA. I put up a demo (https://edison.syntithenai.com) using voice to help fill crosswords.

My suite is designed for network distribution and web integration but can also be used voice only standalone similar to mycroft.
I’d started tinkering with this way back when I signed up as a Mycroft backer and even then saw Mycroft as an open hardware solution.
A pi4 with usb speakers and playstation mic array works OK but ‘barge in’ doesn’t work and it’s ugly.

As a backer I will completely understand if I am never sent hardware. I see that Mycroft as an organisation has done much good in creating a software stack and a vision of an open source platform for developing voice applications where you can make choices about how you compromise your privacy.

Snips offered a similar vision with offline recognition and developer friendly api but closed source and ultimately sold to Sonos.
Picovoice is doing some great work with webassembly solutions with asr and nlu in the browser but again closed source and no local training of models.

Hopefully options for mutual pollination. I’ve certainly gained context from your source tree.
I couldn’t get past the possibilities of network distribution using the mqtt bus .
Last I checked you’re not using streaming speech recognition which is in my mind essential for responsiveness.

Thanks for your great work. Keep the open source light shining.

cheers

Steve

Looks like there’s options for streaming stt’s already: https://github.com/MycroftAI/mycroft-core/blob/6f33cc0553235df483dab109778798f0e2e9fbdc/mycroft/stt/init.py#L352

Using Deepspeech (not streaming) i have unnoticeable latency as well.

my ignorance, ta for the link
S

About one year ago I got DeepSpeechStreamServerSTT working with an older DS-version (0.5 iirc). Besides the proof that it is working my takeaway was that latency did not improve significantly. As far as i understand inference performance was improved with current DS-version. With a beefier machine running DS-server the latency will probably drop… (…note to myself: add DeepSpeech to my todo-list again…)

(I am still using Google-STT as models for my native language - german - are not feasible for every day use, WER >15%)

1 Like

GPU definitely helps after loading.

?

If you don’t stream then you have a latency of at least the wav duration.

Deepspeech is a streaming model, though.

Thanks for the words of encouragement. It is an interesting project.

Quick question, how are you adding dialog? We’ve been giving some thought to piping missed queries ( from opt-in community members only ) to a queue and encouraging the community to tag them as a group effort.

We’re glad to have you as a backer and are looking forward to seeing what you come up with.

1 Like

Hi Joshua,
I’m not clear what you mean by adding dialog. I’ll take a punt and guess you dialog training data/example sentences for the NLU engine.

The RASA philosophy seems to be that real input from end users is the gold.
I’m taking a varied approach using a combination of

  • initial hand carved RASA training data
  • chatito to build RASA training data especially for integrating large numbers of entities.
  • capturing all end user NLU requests as RASA training data to database.
  • providing a tool to end users that lets them fix incorrect NLU results by highlighting section of the last transcript text and assigning intent/entities.

I’m still coming to terms with the art of building an NLU/dialog model with RASA.

  • Less is more. Fewer intents means less possibility of conflict between intent examples. RASA offers some great tools for validating and finding conflicts in models.
  • Integrating many independant plugins/vocabulary packages is problematic because of potential for uncoordinated overlap. Google/Alexa have the annoying plugin preposition. “Hey Alexa, ask meeka music to play some pop”
  • Be generous with entity examples. You don’t have to get all possible values but hundreds are required to do a good job of picking up entities that are not in the training data.
  • Transcription accuracy particularly of uncommon words used in entities can vary based on the ASR engine being used. Action handlers can assist by using fuzzy matching of the value provided from the transcription to a legal set of values.
  • Allow for fallback in action handlers. For example, in my crossword model, asking for an of a without providing a value for results in falling back to a general search rather than one based on wikidata. Even though the intent was incorrect, there is still a useful outcome.
  • With RASA NLU, entity matches(and corresponding values in session) are considered when selecting an intent match. Some intent examples may only include the entity. This increases the prospect of overlap but is helpful to user engagement if used cautiously.

In general, don’t go too general. Pick a small set of critical intents to support your application and expand cautiously from there.

I’d like to feed all of the intents from opt-in members to an engine where they can then be marked for the appropriate skill. I’d like to give the community access to the engine so that members of the community can come in and mark intents.

For example: Play “Huey Lewis and the News” should play music by the band, not trigger the news skill.

There are dozens and dozens of examples like this that we need to figure out how to deal with. In some cases ambiguity needs to trigger a clarification question from the AI.

I’d also like to start looking at how we can engage in more meaningful dialog ( like Replika.ai ) where members can use the dialog engine to simply hold a conversation.

The way I see it the path to success here ( a more natural conversational agent ) is to use data from opt-in users ( as you said - golden ) and effort from the community ( platinum ) to build learning loop.

I wrote a blog post on the overall approach a few years ago. Would be awesome to have some folks trying to put it into practice.

1 Like

Hey Joshua,
the premier open example of what you are talking about is surely Common Voice (https://voice.mozilla.org/) that is collecting validated open licenced recordings for Deepspeech (or any other engine). Users can record text snippets and vote on other users text recordings. Two positives without negative means inclusion of the recording in the data set.

Vital for Speech Recognition where a huge amount of data is required for good results. The Common Voice team are suggesting 10,000 hours of validated audio as a target.
With 1500 hours, the current English language model struggles for accuracy with names and less common words.

Possibly less so for a hotword engine where a small amount of data can be used to train a model for a single speaker although a general hotword model still takes a lot of data.


I see NLU as a different beast. My understanding is that an NLU model needs hand crafting to avoid overlap and balance accuracy vs features and can only be pushed so wide before the user experience suffers. Domain specific. Less is more. KISS.
Large data sets still help.

Audio to text is a one to one mapping. Converting a sentence into meaning is one to many depending on context. “Tell me more”… about what?

Feeding context with dictated text into an NLU model can help with overlap.

Google and Alexa have imposed switching context in requiring a skill name after the hotword. OK for voice apps that continuously engage but “OK Google, Ask meeka music to …” for music control is a pain.

RASA uses a secondary machine learning model based on example stories of sequential NLU intents and actions to select between scored possibilities from the NLU model.
Session context is also used in routing stories and decisions. An intent “stop” is interpreted by the news reader rather than the music player because the session remembers that it was the last used skill in the dialog history.

Mycroft and JOVO support explicit switching on and off intents using a state variable held by the skill server session. I recall Dialogflow had something similar.

There is also potential context from

  • speaker identification
  • location tracking
  • my cloud data eg recent searches and messages and contacts
  • a range of hotwords
  • my social network

Maybe pessimistic about the contraints of NLU engines but I think Mycroft integrators will find they have to be careful with what skills they combine.
I’ve certainly seen plenty of applications start to fall over as I load up the plugins.
Your goal is a broad ecosystem of skills which will inevitably overlap. Curation/certification would help. Skill naming like Google/Alexa is a possibility. (or multiple hotword switching)
Session inclusive context like RASA, combined with other technologies like speaker identification would be helpful.
Lots of examples is probably the best medicine.

I think your suggestion of an open licenced repository of sentence/context to meaning maps would be a great resource for the voice developer community to source initial training data for their application domain.

A challenge is finding a format and tools so that central data can be converted to various NLU training data formats.
It would need to allow for context filters to be attached to training data.
Categorised by application. Music. Search. News. Weather. Joe’s BrainBender Skill.

Curation to flag overlap using additional context filters expecting that users will select their own training data as needed.

Include some tools to assist selecting from the data set, merging training examples with entity data using Chatito and converting to various destination NLU training formats.

From another angle, categorised lists are very useful in generating training data. A model can get by on few intent examples but good entity recognition requires significantly more data. Lists of animals, fruits, people names, movie stars or code for scraping that data would be very useful to improve intent recognition in a restricted selection of domains. Rhymezone shows how wordnet can be used to find related words based on frequency of closeness in an English corpus.

The RASA team have a github repository https://github.com/RasaHQ/NLU-training-data.
Scraping the mycroft community skills repository would certainly be a nice chunk of NLU training data.
Google put up lots of public domain examples for dialog flow.
There’d be heaps of starter material.

There is also lots of folks out there who have shared thoughts and code related to voice development in blogs and repositories.

Tools to encourage capture and collation from live systems like RASA-X generate lots of data that can be particularly valuable in providing sentence structures for intents that may not have been considered in generated initial data.

A website with search tools would be nice but a github repo would do.

README with useful links

skills
  music player
    tools
      scrape
      chatito expand entities
    responses
      - library of spoken responses in ?? format
    examples
      - many files in json format that specify example, NLU parse, related contexts ??
        - allow for entity expansion from scrape sources
  news
  joe's Brainbender Skill
  
tools
  conversion
  scrape
  
lists
  readme with links to open data sources
  famous_people
    - plain text lists of entity values

I’d have some time to feed an NLU sources repository if anyone else is interesting in owning and promoting it.

cheers

Steve

2 Likes

You’re right. Is a tough problem. I do believe it can be solved with careful thought and enough effort.

As a rule, I generally start simple and work out from there as I get a deeper understanding of the problem. In this case I think that is essential because the problem is so complex that even framing it gives me a bit of a headache.

I think it would be great to start with a simply capturing real world intents, tagging the intent and object and dropping them into buckets. Stuff we can’t work out would go in a “TBD” bucket.

We can then use the data to train our basic intents using Padatious ( or RASA if that makes more sense ).

Once we have that working, we move on to a deeper effort?

1 Like