Some questions adressing the technical function of Mycroft

Hey there,

I’m currently writing my bachelor thesis about digital voice assistants. I’m also taking a deeper look into mycroft (cause I’m into privacy and really like your idea of being an open source community project)

However I have some questions left I can’t answer myself, so I thought I’m going to ask them here:

What function does mycroft pass on hardware and which one in cloud?
As far as I understood from an article of one mycroft employee (Securing privacy with Mycroft, an Open AI voice assistant | Opensource.com) intent parsing, skills, and text-to-speech (TTS) are done locally on the device However in the same article it’s written that after the wake word detection, the command is recorded and sent to the cloud for STT.
The transcribed text file is sent back to the device.
Then NLP, skill handling and TTS are carried out.

So does summed up meaning the following?: ASR and Wake Word Spotting as well as recording the commands of the user are done locally and offline.
Then STT is done in the cloud by using Mycroft Home which uses for ex. Mozilla Deep Speech or Google STT followed by sending the Text back to the device, where intend parsing = NLP (meaning NLU and NLG) is done.
The result of NLG is emitted using TTS and the intent of the user command (for ex. starting a skill or doing some smart home automation using the Home API once again?) is executed

If I’m right the only thing done in cloud is STT.
However I saw the personal backend (GitHub - MycroftAI/personal-backend: WORK IN PROGRESS: A Flask personal backend alternative for running your own version of https://home.mycroft.ai) as well as the goal of your personal server (cant link that cause i’m only allowed to use 2 reflinks -.-).
Does this result in being the same? And does its goal is to get offline TTS in an own server which means a user would be able to use mycroft completely offline? If that’s the point I’ll get myself a pie and will start translating my skill from alexa to mycroft :smiley:
Also when you say skill Handling is done on device does this mean that every skill activated is downloaded to be executed offline or is it just the handling meaning to figure out what skill to adress and then calling the skill in the cloud?

Also I don’t get the function of Mycroft Home. Is it like SW that knows which APIs to address depending on the selection the user did in the config (meaning its like an online account linked config)? So for executing a skill, or calling the selected STT ?

Sorry for these (obviously stupid) questions but the more I read the more I get confused.
Also i just used the contact form to ask this but i just found this forum and think asking this here would be better

Hi Mux,

Not stupid questions at all, and you’ve clearly done your reading as it’s mostly as you say.

In terms of TTS, the Mimic1 voices such as British Male aka Popey can be done on device however sound more robotic. The newer voices are using our Mimic2 engine which is based off Tacotron and require more grunt than most people have to synthesise audio in an appropriate time frame. So this by default is also done remotely on one of our servers.

When you install a Skill, that entire Skill is downloaded and installed on your device, so once it receives the transcribed utterance, it can complete everything else on device.

Mycroft Home serves a number of functions including device management, a GUI for Skill settings, and as a proxy for queries to increase privacy of users. So by default currently we use Google STT however to prevent Google from profiling individual users, these are all sent from Mycroft. This means Google can’t easily tell if it’s 10 users making 3000 requests each, or 30000 users making a single request each. This is also done for search queries like those to WolframAlpha, Wikipedia or DuckDuckGo.

The Personal Backend / Personal Server is a Community-driven project to provide a local alternative to Mycroft Home. It’s intended for those who want to run their own services, or those who want even greater customisation.

Also keen to hear more about your thesis. What is your research question at the moment?

Hi gez,

thanks for your fast reply.

So that means if I configure Mycroft to use Mimic1 the only thing that is not done offline/ on device is STT
Also when you say Mycroft is used as an interface to process Google STT does that mean you are forwarding the traffic by using a proxy server (or a specified proxy from the Mycroft Home Settings)?
But anyway pretty cool to see that there are still very good solutions in terms of privacy out there :slight_smile: However it’s sad that so many people out there dont care about privacy and data security…

One question about your OpenSTT. Will it be completely in cloud too or do you think about also bringing it as a one device solution to Mycroft?
Because I thought to actually process everything offline you’d need some HW power to get all steps done (I assume a raspberry actually has more power than any “conventional” smart speaker/display out there) on device quite fast without leaving the user with a long time space until he gets his response or action done.

But what’s the point of having a local alternative of Mycroft Home? Apart from greater customization I dont really see a use case there for the private user. Or is it used more by companies (since u also provide them/ help them with set as what i read)

(i hope that this isnt to detailed/boring for you :smiley: )
My thesis is about digital voice assistants in home and mobile (devices).

Its goal is to provide an overview about this topic from a technical perspective. Also it should help the decision-making in terms of application development.(so what platform to run, where are the most users to address, what are the used for).
Therefore I’m talking about the technical background, as well as some platforms (alexa, google assistant, siri, bixby, cortana and Mycroft), some Market and user analization and in the end I’ll focus on application development.
For the latter my current goal is to develop an application for google assistant to compare it with development of an alexa skill (what I’ve done in my last thesis and was the worst experience every sine there were so many problems of not working functions’ amazon provides). Also i want to develop the same application for both plattforms using jovo and compare it to the native development.

Since my time is limited I’ll not be able to also include a skill dev for Mycroft, however personally I’m a big fan of Mycroft and hope to find some time for this in my free time. Also as dual student i have to focus on the most used and popular technology out there for the company (to adress the most potential customers).

No worries,

Yes, if you’re performing TTS on device, the STT should be the only remaining cloud service in terms of interactions. The device still needs to pair with a backend for skill settings and of course Skills will make network calls eg general questions use DuckDuckGo / Wikipedia / WolframAlpha. System services will also make network calls eg to sync the clock with an NTP service.

I’m sure my conversations are a biased sample, but most people I talk to (tech and non-techies) are concerned about privacy. I think a lot of people would want a voice assistant in their home but they think it’s creepy knowing a company is listening to everything they say.

The open source STT we are aiming for is Mozilla’s DeepSpeech. It is most likely to be cloud based, though Mozilla have done some refinements and testing to try and get it working on device. RPi’s have more grunt than say a Google Home Mini, but currently a responsive STT service needs a pretty reasonable GPU. I think this will change over time but expect the first iteration to be cloud based because of the response time you mentioned.

Some people don’t want their device to communicate with an external server at all. So would like a way to pair, manage skills etc, all within their local network. Particularly if they’re using it for home automation rather than answering questions etc, you could have a system that is entirely restricted to your LAN.

Sounds like a really interesting project, look forward to reading it when you publish :slight_smile:

Thanks for the answer and sorry for the late reply.

I know what you are talking about but a lot of persons say that privacy is important but the next moment they are using Instagram, Facebook, WhatsApp and every google service. They dont manage any permissions on their phones nor think about if it’s good to download yet another spy app and give it every permission. So saying its important but not thinking about it on the other hand is not really being into privacy for me. I mean lots of people are complaining about the permanent listening of smart speakers yet they give the microphone permission to every app. So where is the difference? :smiley: (At least thats what i experienced so far)

So Open STT is just working together with Mozilla to improve its DeepSpeech by providing the Open Dataset?

I think having every thing in your LAN would be a great effort. Maybe it’s possible one day. However Mycroft does a very good job for the users’ privacy. Actually cause STT and TTS is forwarded using a proxy there is no real need for a LAN solution. Anyways It’s a shame Mycroft is not that well know to non technology guys (for what I experienced).
Actually It’s the same as with messenger apps. There very good options bringing you privacy like Signal, but nobody knows about it and uses whatsapp…

I’m writing it in german so dont know if you are still interested than. But if yes and once I’ve cleared whether it’s classified for secrecy or not, I can send it to you as soon as it’s done.

So true, we need to look at behaviour not just polls.

We do a few things with Mozilla, but the open data set is quite useful not just because it’s a lot of data. It has the added benefit of providing more real life voice samples, loudly across the room, with kitchen pans crashing, or kids screaming in the background. This is the sort of voice data we need to handle for mass adoption, rather than just those spoken directly into a mic at a computer.

Ah I see. Ty for the info