Large Custom Vocabulary


Hi All,

Evaluating for a medical project. I will need to recognize several hundred medicine names, as well as several hundred medical procedures. These words will be spoken by doctors. I understand that I will have to train the STT engine. Looking for some advice on which one of the popular STT engines might handle this the best that works well with the Mycroft project.



Deepspeech (by mozilla or Kaldi ( are probably the two to look into.

For DS, check out the quick tutorial here: and modify that to fit your own needs. The more high quality samples the merrier. The training takes a good chunk of hardware but can be done on cloud resources, as well. fiddled with kaldi but never done additional training with it, there’s examples on the web, of course.

Might be a long shot, but would be great if you could public domain/creative commons license the resultant data set.


Hi there, sounds like an interesting project.

As baconator said Deepspeech is a great option to look at, especially if you are in a position to train the engine with data that matches the diversity of your end users. There is a strong community behind it thanks to Mozilla, and Mycroft is very happy to be a part of that.

I haven’t looked into this but I wonder if IBM’s Watson would have a strong existing medical foundation given their focus on the medical space over time?


That’s exactly what I was looking for. A detailed tutorial, and it describes a standard process I understand from ML. Didn’t think to look for those keywords during my Googling.


LOL. Hadn’t played in the IBM world and skipped the bluemix website, since I hadn’t heard of it. Yes - they specifically mention training for Law and Medicine.

Thanks for the help, this and baconaor’s posts have exactly what I was looking for.


It’s an interesting problem - is there even a way to tie in multiple STT engines, or are you talking about extending an existing model by adding additional vocabulary?

I just went to a long list of medicines and medical procedures and tried speaking them to my Android phone. It appears that 80% of them were recognized, though I tended to steer away from trying to speak things that I didn’t know how they were pronounced. It would be a lot easier if the STT engine already recognized enough of your vocabulary.

Once the STT has run, you will still have a problem of matching to entities in your database. For example, if the person says “blood count”, that could be matched against “complete blood count” or “red cell blood count” or a number of other terms. Even if they say “complete blood count” it could also mean “complete blood count with differential”. For an application like this, I suspect that the precision of the match is crucial for your application.

I faced a similar problem with a skill that matches against a large vocabulary of song names, artist names, album names, genres, and lyrics from songs (obviously the precision is not as important as yours). The database consists of maybe 100,000 terms and I wanted to recognize things like “street fighting by the stones” to be a match against the song “street fighting man” by the artist “The Rolling Stones”. My past experience in NLP led me to understand that users will seldom give a fully qualified query, and the problem becomes how to resolve partial matches. You end having to rank the possible results according to their likelihood. This is essentially a search ranking problem (disclaimer: I worked at Google for 12 years, so this has influenced how I think about NLP).

In order to solve my problem, I ended up using the xapian text search library. This has python bindings so you can both build the index and query the index using python from a skill. Unfortunately the library cannot be installed with pip, so I haven’t been able to distribute my skill to others - it requires installing by hand.


Thanks for this info. I will research the Xapian, as I might have use for it in another project as well.