Why we're moving to DeepSpeech on March 31 Privacy, Speech to Text & Balance

steve.penrod · January 11, 2018, 10:06pm

Originally published at: Why We’re moving to DeepSpeech on March 31 | Privacy, Speech to Text & Balance - Mycroft

Our community asked, we’re answering. Read notes from our CTO, Steve Penrod about our move to DeepSpeech.

From Steve;

Misunderstandings of how Mycroft performs Speech to Text is one of the things I hear about regularly. So today I’ll provide some clarity on how it works now, why it works that way, and where we are heading with this technology in the future.

Mycroft Design Decisions

It is rare that there is only one way to solve a problem. So every project requires thoughtful analysis of the end user's needs to select the best problem solving approach. With careful thought, experimentation and planning it is usually possible to work out a balanced approach to solve the problem. One of the key factors in choosing a solution is the design criteria. What does the user value? How does the development team best provide that value? What is the top priority? What can wait? When we started Mycroft our design criteria consisted of:

Choice

One of the key things we wanted to enable is choice. We knew that the technology we were using two years ago would likely not be the same technology we'd use in five years. Similarly, we knew that the technology that we decided to use would not be the technology that everyone would want to use. So we architected our system to allow choice — by both the project and by individual users — for the key technical components: Speech to Text, Text to Speech and Skills.

STT: Performance

Next we knew that the only way this would be a successful project is with a good Speech to Text (STT) engine. Being 90% accurate sound pretty good — 90% is an A, right? But if 90% isn't always acceptable:

“Please turn on the kitchen right”

That is close, but no-cigar if what you said was:

“Please turn on the kitchen light”

Errors like that cause frustration and quickly lead users to set aside a technology, so we wanted our software stack to default to the best experience possible.

Simplicity

Another thing we wanted was simplicity. We didn't want to make a user spend 30 minutes setting up defaults. I know I've looked at projects that might have been cool, but the setup was just too much of a hassle to even try. Some systems even run on specific hardware which, if the user doesn't have it handy, makes the technology unusable. Or setting up several accounts with other systems it depended on. Every setup step and requirement increases friction and shrinks the audience, and we wanted to make Mycroft available for everyone so simplicity was key.

Privacy

We knew from the start that voice technology was uniquely intimate. To be useful, it has to be available all the time. That means it has to listen constantly, to every single word you say. It should ignore most of those words, but it still has to listen.

The concept of ‘wake words’ helps. I can (usually) feel good that strangers aren’t able to listen to me without my knowledge. But someone still can glean intimate details by aggregating all of the queries and command I speak.

“Where can I buy clown makeup?”
“Are there any clown colleges?”

“When is the next clown convention in Kansas City?”

Maybe I don’t mind the world knowing that I have this interest in clowns. But then again maybe I don’t want my clown addiction exposed when I run for a political office a decade from now. Regardless, I shouldn’t be forced to give up all my privacy just to use voice technology.

Balance

Weighing all these factors, we made some key decisions early on.

We decided to setup a Mycroft STT service on our servers which would provide transcription for our user. This approach allowed us to choose the best STT technology to run that service. Our initial implementation has been to use Google Speech To Text service — it was by far the most accurate at the time and remains so today. Others were in the 70-80% accuracy, but Google was already at 90%+. Thus, we give Mycroft users the best performance.

To provide privacy we decided to aggregate all of our speech to text requests into a single bucket and set the source to “MycroftAI” — not the end user individual. So in the example above, my clown requests would all be blended in with the requests of hundreds of others. Since there is no easy way to connect my request to me this provides privacy.

Finally, we decided to provide an easy mechanism for users to switch to other STT engines, completely eliminating Mycroft from the process if they want. That provides choice.

This was the best balance.

Maintaining Balance

Balance is not a static thing. Things change. New forces come in to play. You have to remain aware and shift to keep your balance.

There are several new technologies available today that weren’t available two years ago.

New STT technology
More powerful hardware
Better awareness of privacy concerns

New STT: DeepSpeech

Since this summer we have been working with the Mozilla Machine Learning team. They created a new, open source, machine learning-based STT technology called DeepSpeech built on research started at Baidu. We've assisted with Project Common Voice and are creating a new mechanism allowing Mycroft users to participate in building the Open Dataset to provide more real-world data for use in training to improve the system.

At the beginning of the summer the word-error-rate for DeepSpeech was at around 15%. By the fall it was at 10% and it is continuing to improve as more training data is digested. This is now in the accuracy realm needed for a voice assistant.

For those with the know-how and resources, you can already setup and use DeepSpeech on your own high-end equipment today.

More Powerful Hardware: GPUs and TPUs

Two years ago a Graphics Processing Unit (GPU) was an expensive accessory needed only for the latest 3D shooter or to drive the new VR toy called an Oculus Rift.

Today, GPUs are being used to mine cryptocurrencies, to power self-driving cars and, yes, to accelerate STT. DeepSpeech on a simple CPU can run at 140% of real time, meaning it can’t keep up with human speech. But with a good GPU it can run at 33% of real time.

Tensor Processing Units (TPUs) are just emerging and promise even higher speeds for TensorFlow systems.

Privacy Awareness

Now that Amazon's Echo has become common, people are really starting to think about what it means to have an always-on microphone in your home. What are the incentives and motives for these services? What do you really know about how they work?

A new balance

These shifting forces have lead us to the new balance, appropriate for today. We are changing our default Mycroft STT engine to DeepSpeech. This will happen on March 31st. This means that none of our user's queries will leave the Mycroft perimeter unless the user is accessing an outside service ( weather for example ) in which case the source of the query will be "MycroftAI" and not an individual user.

That said we will continue to allow Mycroft users to choose their text to speech engine. Users can easily select Google, Watson, or Wit.ai as a STT provider (or Kaldi, Bing, Houndify or more if you are willing to get your fingers dirty).

This is the new best balance.

Your Privacy Going Foward

Here at Mycroft we take privacy seriously. We don't preserve data unless we are given explicit permission. We don't sell data to third parties and we don't intrude on the lives our our customers. We aren't trying to sell you products, dominate online advertising or own your digital identity. Switching to DeepSpeech advances our goal of providing our users with the highest qualtiy, most private experience possible. We're proud of our commitment to privacy and look forward to bringing it to more and more of the Internet over time.

KathyReid · January 12, 2018, 5:56am

Excellent write up, thanks!

oren · January 13, 2018, 7:33am

Can you elaborate on this? Is the STT, skill execution, and TTS all done on my laptop (If I have the Mycroft software installed) or on the device (Mark 1/2)?

J_Montgomery_Mycroft · January 14, 2018, 5:49am

Right now the only part of the interaction one off site is STT. We do, however, manage settings via the https://home.mycroft.ai servers. This allows users to aggregate settings across devices ( Pandora credentials for example ).

Today we aggregate all of the user queries into a single bucket, shake the bucket and send the queries on to Google. The Google servers have no way to know that you are the origin of the query, however, they could get personal information if you were to say “Hey, Mycroft - My name is Inigo Montoya, you killed my father, prepare to die”. Google would then know that you are the swordsman looking to kill the six fingered man.

The shift to DeepSpeech will allow us to run STT on our servers without ever touching a third party. In fact, we are moving our server infrastructure to a private data center some time in the next month or two so that we own the entire user interaction including the actual iron running the virtualization cluster. Since I own the data center through Wicked Broadband we actually own everything including the fiber infrastructure all the way to Hurricane Electric’s Internet core.

Without a GPU on the Mark I device we can’t perform faster than real time speech to text transcription so we have to run the service in the cloud. We are going to launch a new version of Mimic soon based on Tacotron which also needs a GPU to run faster than real time. This will also run on our infrastructure.

So, for now, Mycroft is still tethered to Internet services…but there is hope!

I’ve been giving a lot of thought to the privacy implications of Mycroft’s technology and I am a strong believer that Mycroft users will benefit from the ability to run the entire user interaction on-premesis. This is true for individual users, but expecially true for our larger corporate customers who want to maintain data independence.

We’ve already confirmed that we can run the Mycroft back-end stand-alone. We did this to make sure that when we start deploying our first major corporate engagement ( announcement to follow in the next few months ) we will be able to move the entire stack on premesis.

When we go to production in February 2019 with version 19.02 I’m going to advocate strongly for a desktop/server agent that can run in an individual user’s home. This agent may run on a desktop or a dedicated “Mycroft Server” that connects to your router. The service will make use of the computer’s GPU to perform Speech-To-Text and Text-To-Speech transcription using DeepSpeech and Mimic II respectively.

That said, I don’t control the development roadmap. After the 18.02b deployment on February 28 we will be building the roadmap in close consultation with our community. As the CEO I can advocate for features like this, but ultimately it is up to the community of Mycroft users.

malevolent · January 16, 2018, 11:38am

Thank you very much. Privacy matters, and if we switch to DeepSpeech we can contribute with another open source project, I thin is a win for everyone. I can sacrify a bit of usability during a period if that means privacy for me and mine, and help to improve a open source (or two) project.
So, waiting the March 31th to switch the engine!

KathyReid · January 16, 2018, 12:12pm

We agree @malevolent - so many benefits in going down this direction.

HenryMiller1 · January 16, 2018, 9:17pm

What we want depends on how well deepspeach works in the real world. If it is just barely acceptable in the best cases then more resources making it better at the expense of privacy is important. I don’t want to trust you, but I don’t want to trust Google, Amazon, or anyone else either. However getting the basics right is needed before I can ask for privacy.

Make it work, make it right, make it fast. Right now voice ai just barely is into the it works category.

steve.penrod · January 19, 2018, 6:51pm

I’m completely with you, @HenryMiller1 – if the base STT technology doesn’t work at 90+% accuracy, the voice agent is just an irritant. I started attempting to build this sort of system back in the late 90s, but the limitations of STT at that time made it a non-starter. We have been able to bootstrap the system with effective techniques that have required some compromises so far. Now we need to carefully transition and be assured that maintaining the base usability is a prime requirement for us.

HenryMiller1 · January 19, 2018, 7:46pm

What I’m saying is if deepspeach is barely at 90% I’d rather your time go to making that better using your servers. If deepspeach is 99.999% then I want to run it all locally for privacy reasons.

steve.penrod · January 19, 2018, 8:01pm

You’ll be able to easily switch back and forth, so you can evaluate it yourself and decide what works for you.

Elleo · January 21, 2018, 10:29pm

A couple of weeks ago I created a GStreamer plugin for Deep Speech which you might find useful:

I also wrote a quick example showing how simple it is to create a python application that records audio, converts it into a suitable format and prints out any recognised speech:

github.com

Elleo/gst-deepspeech/blob/master/examples/python/print_speech.py

#!/usr/bin/env python

from __future__ import print_function

import gi
gi.require_version('Gst', '1.0')
from gi.repository import GObject, Gst


def bus_message(bus, message):
    structure = message.get_structure()
    if structure and structure.get_name() == "deepspeech":
        text = structure.get_value("text")
        print(text)
    return True


if __name__ == "__main__":
    GObject.threads_init()
    Gst.init(None)

This file has been truncated. show original

It performs automatic audio segmentation based on silence thresholds so it can process audio in small chunks as it’s spoken, allowing for continuous dictation where needed too.

KathyReid · January 24, 2018, 12:47am

Thanks, that looks awesome. How heavy did you find DeepSpeech on CPU/GPU?

Elleo · January 24, 2018, 4:25pm

On my Thinkpad using just the CPU (Skylake i7) it’s slightly slower than real-time, performance is better with an old GTX 980M GPU using CUDA. However since DeepSpeech currently only takes complete audio clips the perceived speed to the user is a lot slower than it would be if it were possible to stream audio to it (like Kaldi supports) rather than segmenting it and sending short clips (since this results in the total time being the time taken to speak and record plus the time taken to perform inference instead of the two happening simultaneously).

KathyReid · January 25, 2018, 5:50am

Super helpful, thx @Elleo

abuvaneswari · April 5, 2018, 4:22pm

Hello @steve.penrod, @KathyReid,

Have you moved to DeepSpeech server as per the original plan? The STT performance continues to be near-perfect, even after March 31st, 2018!

Can you please list the training corpora that you used for training Mozilla’s DeepSpeech model?

I am asking this because I did not get the near-perfect performance when I utilized their officially released pre-trained model in my local deepspeech-server. I assume you trained the DS model further with some additional dataset. Wondering what that dataset is.

Thank you.

KathyReid · April 5, 2018, 4:38pm

Hey there @abuvaneswari, great question.

We have made the option to configure Mycroft STT to use DeepSpeech available, with more information available here: https://mycroft.ai/blog/deepspeech-update/

Right now we are not training using a custom data corpus; we are using the publicly available models. The DeepSpeech public models are not yet as accurate as other STT engines - which explains the experience you’ve been having.

We’re now focussing on how to build the DeepSpeech dataset and how to label it to help improve accuracy.

Kind regards,
Kathy

abuvaneswari · April 5, 2018, 5:52pm

@KathyReid, thanks a lot for your quick response. I went thru the blog post on deepspeech-update and appreciate your effort in making the powerful servers available for DS inferencing.

So, I suppose you have not made the switch to DS as the default engine yet?
Please confirm.

I am working on similar effort, training DeepSpeech so that the sentence recognition rate can be improvised in a conversational / interactive scenario. I am making slow, but sure progress; will be glad to share my experiences, datasets and results when I am close to my target.

KathyReid · April 6, 2018, 9:58am

Thanks so much for your kind offer, @abuvaneswari. I can confirm that DeepSpeech is not the default STT engine yet, primarily because of the accuracy of the speech recognition - which of course, is improving every day.

We’d really appreciate your thoughts in our Chat channel for machine learning at;
https://chat.mycroft.ai/community/channels/machine-learning

Kind regards,
Kathy