Languages are Hard

Originally published at: https://mycroft.ai/blog/languages-are-hard/

Comprenez vous? ¿Entiende usted?
Voce entende? Apakah anda mengerti?’


Voice interfaces hold so much promise because of their ease of use. But a truly intelligent voice agent should be able to listen and speak to you in your own language, right? Many are excited about the possibility of using the Mycroft open voice assistant with their native tongue.

Currently, Mycroft only officially works in English, with Community-driven efforts underway to support French, German, Italian, Portuguese, Spanish and Swedish. The exciting thing about the open source, Community-based approach of Mycroft is that translation can be done by more than just the individuals writing Skills. The users are empowered to band together and start support for their own language!

We’ve made a start by providing some initial documentation if you want to experiment with language support. Here, we break down how foreign language support must be implemented at each layer of the Voice Stack, and provide an overview of our language support roadmap.

How do languages need to be supported across the voice stack?

In order for foreign language support to be useful, it needs to work across the entire Voice Stack. The voice stack is the combination of software components which, just like a layer cake, stack together to provide a voice service. Let’s take a look at them:

Wake Word

The Wake Word is the layer that tells the voice assistant to ‘wake up and start listening for commands’. It’s sometimes called a hot word. By default, the Wake Word on Mycroft Devices is ‘Hey Mycroft‘. Initially Mycroft used PocketSphinx for Wake Words, but moved to the Precise Wake Word engine last fall.

PocketSphinx and Precise work in different ways. PocketSphinx maps phonemes - think of these as sound building blocks - to graphemes. Graphemes are word building blocks. This way, PocketSphinx knows that the phoneme sequence HH EY M AY K R O F T matches the words “Hey Mycroft”. To learn more about the differences between phonemes and graphemes, this blog post is a great start.

In contrast, Precise works using a neural network. Learning from tagged samples of Mycroft users who have opted-in to our open dataset, Precise is able to build an accurate model of the Hey Mycroft Wake Word. Building a model using samples from a wide variety of genders, accents and tones means that a man with a French accent saying Hey Mycroft will be recognized as well as an Australian woman, even though the phoneme sounds in these languages differ significantly. The downside of this approach, of course, is that the model needs a large, diverse dataset for training.

If you wish to use a Wake Word in a language other than English at the moment, you really have two choices; each has benefits and drawbacks. You can set a custom Wake Word in your home.mycroft.ai account using the phonemes available in the English PocketSphinx dictionary. You won’t be able to use phonemes that don’t occur in English - for example the “sch” sound in German, or the “xo” sound in Catalan. If you want to use non-English phonemes, you will need to install a PocketSphinx dictionary in your chosen language; which is not for the faint-hearted and requires advanced Linux skills.

Speech to Text

The Speech to Text (STT) layer is the part of the Voice Stack that transcribes what you say to the Mycroft device. Currently, Mycroft defaults to the excellent STT engine from Google, but anonymizes all the requests so any traffic is just seen by Google as ‘Mycroft’. Google STT supports several languages other than English, so if you speak a supported language you can edit your mycroft.conf file to try it.

We intend to move to DeepSpeech as our default STT layer in the future, and you can try DeepSpeech on Mycroft now. This offers many more options, including the potential to host DeepSpeech on your own server and, potentially, directly on your device. However this is still young technology – currently version 0.3 – and the only trained model is currently English. We are working with Mozilla to expand its language range in the future and to generalize the process to support every language!

Intent Parser

Once the Speech to Text layer has turned spoken words into text we call it an Utterance. The Utterance is then run through our Intent Parsers layer. The role of an Intent Parser within the Voice Stack is to match an Utterance with the intended action in a specific Skill - that is, find the “Intent” of the user.

In the Mycroft Voice Stack, there are two different Intent parsing phases:

  • Adapt: Keywords from the vocab files and patterns from the regex files of the Skills are combined into Intent rules and used to find text matches. This will generate a confidence score for any matching Intent. Flow of control is passed to the Skill with the highest Intent confidence score.
If Adapt can’t parse the Intent…
  • Padatious: A neural network determines the confidence score based on Intent examples provided by Skills. Flow of control is still passed to the Skill with the highest Intent confidence score.
If neither Intent Parser finds a match, the flow of control is passed to a Fallback Skill like Wolfram|Alpha to handle the Utterance.

Before the Utterance is run through the intent parsers, a language-specific normalization occurs. Normalization cleans up the transcription, doing things like converting contractions to their expanded form (e.g. “What’s the weather like” becomes “What is the weather like”). This code must be added to mycroft-core itself for each new language. For example, normalization in Portuguese, which distinguishes between masculine and feminine forms of a word, would need to account for both masculine and feminine phrases.

Skills

In the Voice Stack, the role of the Skill is to do the ‘heavy lifting’ and provide the user with the outcome they wanted - such as reporting the news or weather, or playing a piece of music.

The Mycroft skills system has supported multiple languages from the beginning. To support a new language, each Skill must translate three different pieces:

Vocabulary

Independent directories within the Skill hold vocab for the various language codes. For example, a skill written originally in English will have several files like vocab\en-us\Word.voc, with the English language pieces in the *.voc files. Adding German support involves creating vocab\de-de\Word.voc files holding the German version of the same words.

A Skill might also use regular expressions in parsing, contained in its*.rx files. For a regex pattern match not only do these words differ between languages, but also the phrasing and placement of words changes.

For example, let’s take the phrase “How’s the weather today”. In most European languages, the phrasing follows the structure “question - keyword - day”. However, in Turkish, note the two phrases:

  • “Bugün hava nasıl?” - “How’s the weather today?” (Literally, “Today, how’s the weather?”) “Hava nasıl?” - “How’s the weather?”
The structure is “day - question - keyword”. This means that not only would regex files need to be rewritten to support Turkish, but the structure of the expressions needs to be changed as well. This process can be complicated further by languages which classify objects as masculine and feminine, because more regular expressions are required to cover all the cases needed to correctly identify an Intent.

Dialog

Most Skills have lines of text spoken when a Skill completes a task or spoken when information is returned through an API. These Dialog files need to be changed for a new language.

Skill internals

Within some Skills, extra conditional processing may be required to handle new languages. For example translating phrases that come back from an API in English into the target language, converting “cloudy” to “bewölkt” in German.

Text to Speech

A Skill will normally complete execution by speaking a line of Dialog to the user - like saying “the weather in Geelong today is clear skies and 22 degrees Celsius”. This is the Text to Speech layer of the Voice Stack, and its role is just that - to speak written information.

The default TTS engine used in Mycroft is Mimic. Mimic is currently available only in English, so if Mimic tries to speak foreign words, or words with diacritical marks (such as say the ö sound in Swedish), then the pronunciation will be unnatural.

Mycroft, being modular, allows you to select other TTS engines. The Google TTS engine has more language options available. Again, if you want to configure this for your language, you need to edit your mycroft.conf file.

Building a new Text to Speech engine like Mimic is very difficult, requiring expert level understanding of languages to build the phonetic mappings, plus generating the voice pieces for the synthesis.

What does the language roadmap look like?

As you can see, providing language support is no easy task. We are continually improving, and the following steps are part of building the tools to officially support more languages.

Training other Wake Words in Precise

Once the current ‘Hey Mycroft’ Wake Word in Precise has an accurate model, we will be opening up the Precise Tagger to allow tagging of other Wake Words, including Wake Words in other languages. Since Precise itself is trained directly from recordings, it is already multi-language ready.

Moving to DeepSpeech for STT

Estimates say DeepSpeech requires 10,000 hours of tagged samples to provide a workable STT model for a language. The English dataset is still being built as well as the actual machine learning code that runs DeepSpeech. Gathering 10,000 hours would be a huge amount of work for any individual, but spread over many collaborators it is a much more manageable task. We are creating the tools to collect and tag these training datasets as a community.

Adoption of Python 2.7 => Python 3 for Mycroft Core and all Skills

Mycroft began with Python 2.7, which does not have built-in Unicode support. Unicode helps represent text in languages which do not use the Latin alphabet (the alphabet used to write English). With the recent transition to Python 3, this hurdle has been cleared.

Mimic 2 for TTS

We are putting resources into a new method for Text to Speech. Mimic 2 is based on a neural network and as such is “trained” rather than “programmed”. We estimate that it will take around 20 hours of recorded speech to yield a reasonable language model in nearly any language. These recordings need to be clear and of one single voice in order to accurately train the model, but this is still much easier than getting a Ph.D. in order to be able to build support in the original Mimic.

The first English language voice is being created, as well as working out the kinks in the recording process. Recommendations and tools will soon be available for the Community to build their own voices.

Harvesting tools for vocab and dialog files within Skills

Over the next few months, we're creating a harvesting tool to identify all the vocab and dialog files within Mycroft Core and Skills that need to be translated. This will make it easy to see what needs to be done to support each language, and make it easier to keep up as new Skills are being created and old ones change. This will require no programming skill to assist in bringing Mycroft to your favorite language.

What can I do right now?

So what can you do right now to help advance language efforts?
  • Help with Precise Tagging: The sooner we have the Hey Mycroft Wake Word training well, the sooner we can move on to training other Wake Words. You can tag Precise samples at home.mycroft.ai under "Tagging."
  • Opt-In to our Open Dataset: So that we can gather a diverse set of spoken samples with many voice types and accents, we need lots of people contributing to the Open Dataset.
  • Have web-dev skills? Contact us! We’d love help with the harvesting project and associated web interfaces.
  • Be patient! As you can see, multi-language support isn’t easy. But the Mycroft Community has the best potential to support not just the most profitable languages, but all of the languages.
2 Likes

As for deep speech tagging:
Does “Beeping at the start and background noise are fine” mean the beeping has to be audible in the recorded sample?
Or will a sample be fine that has no beeping but otherwise clear audio and meets the first requirement “Text matches exactly what is spoken”?

Hi @dottedfish, thanks for the question. What if re-worded it to:

Beeping at the start is allowable for a 'Yes', and background noise is allowable for a 'Yes', however the transcription must _exactly_ match the spoken words. Homonyms - words that are spelt differently, but sound the same, should be marked 'No' if the wrong homonym is used. For example - "The square root of pi" is a 'No', but the phrase "Please order an apple pie" would be allowable as 'Yes' if it matched.

The wording for this is a bit awkward. Maybe we need more examples of what is ‘Yes’ and ‘No’.

cc my colleague @Mn0491

I lost you at the pi example to be honest. :slight_smile: Not sure if the sentences provided really fall under homonyms - but it is a concern indeed. The wiki has a clear example: fluke
A fish, and a flatworm.
The end parts of an anchor.
The fins on a whale’s tail.
A stroke of luck.

However, I doubt that this example will actually appear. The text to speech engine would just interpret the audio and translate it to text without caring about the meaning right? But anyhow, I included your suggestion.

That said I’m not a native speaker but here’s my humble suggestions / what I’d like to see or what would be more clear to me:
What is wanted / expected for a vote category.
What is not allowed in a vote category.
If applicable / common: What is tolerated.

So I’d rephrase and clear things up a little, for example:

We want:
User directed inputs

We do not want:
Inputs from radio, television or music as well as random family conversations or background talk.

YES
Text matches exactly what is spoken
Beeping at the start and background noise are fine

  • Text must match exactly what is spoken.
  • Can not contain background sounds (for example TV).
  • Slight background noise or buzzing is fine.
  • Beeping at the start is fine but not required.

NO
Text is slightly different than what is spoken
Text is completely different than what is spoken
Multiple voices / background voices
If uncertain, just pick ‘No’

  • Text differs even slightly from what is spoken, including homonyms
  • Text is completely different than what is spoken
  • Multiple voices / background voices
  • Audio is matching the text but coming from background sound sources like music, television or similar

SKIP
Unsure about a word in the audio

  • Unsure about an applicable category.
  • If uncertain, just pick SKIP

I see no reason for the SKIP option if NO already states: “If uncertain, just pick ‘No’”. So maybe the above suggestion meets the applicable logic.

thanks for the suggestions on clarification. we will take this into account

1 Like