The Mycroft Personal Server - Starting the Conversation


If you want to run STT or TTS in anywhere near real-time, you need a GPU. Even on an i7 you can’t come close to the performance of a GPU for things like this.

At the same time, the underlying technology is changing very rapidly too. With DeepSpeech, the training sessions this spring took 2 weeks on 2 multi-GPU machines. After some major rearchitecture this summer, it now can run a training session against even more data on 1 of those machines in 3 days. So… specing exact hardware is probably not the way to start. I think we should begin building the technology and see what is needed to support it once we have all the basic pieces in place.


Hello Steve,

I am quite interested in a Mycroft Server. I tend to run many local network services vs using the equivalent cloud services.

A few things I’d like to see in a Mycroft server.

  • Docker/Docker compose as a deployment option
    • I recently moved from a KVM based environment to docker and I’m running more services on the same hardware. Begin able to deploy/update this server using docker would be great.
  • Incubate features.
    • Users of the local server could opt in to beta releases/features and do an initial round of testing before that feature is deployed to the public server.
  • Optional Paid Support.
    • While the intent clearly states personal and not for business. A server will be deployed in/on various network/hardware configurations and there will be a support cost. This could be structured in a way where all the information to troubleshoot the server would be available to those who look and completely optional.
  • Client to Server version support/compatibility.
    • What is the contract for backwards compatibility. Will the server always support the oldest to the newest client?
    • Will the server be able to support clients with varying versions?
  • Ability to update similar clients with a single image.
    • Would be nice for the server to download a single update image per client type. That way if you have multiple Mark 1’s the server downloads the update/image once and the clients update via the server.

I agree with all of your points and am looking forward to a local Mycroft server… though it sounds like I’ll be needing a dedicated GPU to run it :slight_smile:


due to mozillas blog post, it should be possible to run a 3s file in 1.5s on a “laptop cpu” with deepspeech 0.2:
So no need for a GPU anymore :thinking:


Ah, I hadn’t see that blog by Reuben, just heard him saying it was faster – that is closer to acceptable! As I said, this is a moving target and it might be possible to operate without one. Some components, like the Mimic2 voice generation, still perform much better on a GPU. Regardless, I think it should be part of this projects goal to not only allow a user to run a basic Mycroft experience but also run Mycroft as optimally as possible. That would certainly include whatever is necessary to support GPU utilization.


Super glad to hear about the home server. I like to host (almost) everything by my own, and let geek users to host themselves the services can be a good point. Happy to read a GPU is not needed anymore, because many microservers has powerful Xeons processors with sad integrated GPUs

Besides the obvious benefits you told before (connectivity and security), can be performance another benefit, if finally fast GPUs aren’t needed as requirement? I mean, mimic2 is quite slow here (on a 300Mb/300Mb fiber connection) Home server would improve this as is entirely dedicated to my few mycroft devices?

I would like also contribute with some ideas (and testing when you finally decide to release it):

  • Docker containers: as @sampsonight suggested. The docker approach is better than git pulling as we do with mycroft-core: no matter what distro are you using, what libraries do you have, it will install with YOUR mycroft image. On the other hand, as mycroft is not IO-intensive, it doesn’t need too much kernel context changes, so containerization will provide full hardware access and easiness for deploy.

  • Federation: as a community, we can federate ourselves, and… I don’t know, whatever we want: create a mesh network, provide some CPU power to those less capable, sharing something useful. Surely you can think about something useful if you can count with some hundreds more servers. Obviously, federation must be optional, and it should be profitable for the users (premium voices and so, perhaps?) and for you to invest time in developing it (CPU time and horsepower to do some secondary tasks like training voices or the like). Federation algorithm should be fair enough to use the federated CPUs when idle (or low usage) and have less priority and niceness to let usable the home user server.


I am glad to hear about the possibility of running without the GPU. Assuming (As I haven’t checked recently) that the effect that Cryptocurrency mining had on the price a decent graphics cards is still in effect. A quick look would say “Yes it is” then really it is hard to justify the cost of building the hardware platform to host the server, if a GPU is mandatory.

As I said earlier if it can serve a purpose then I’ll happily have a go with my DL380.


I don’t want to totally promise “no GPU” – neural networks are REALLY well suited to run on the massively parallel architecture of a GPU. If you don’t know why, here is a little info on it.

However, even a cheapo GPU can be valuable for this. For example, even the cheezy GPU on a Raspberry Pi can run some ML type tasks 3x faster than the CPU.

As for the cost of GPUs, they’ve dropped significantly since the ridiculous spike this spring. Only time will tell where this goes in the long run, but I’m assuming Moore’s law will make the mid- to low- power GPUs cheaper, as the crypto folks really don’t get any advantage running one of them as a miner.


Very valid point @steve.penrod. It does strike me though that there will be an exponential relationship where the number of parallel high performance CPU cores crosses the point where the real time advantage of a GPU becomes minimal. Do you have any idea how many parallel threads are running in the Deep speech platform?


I tend to use Intel + integrated CPU nowadays on my Linux computers because anything else gives problems at one time or another…

Current devices have a somewhat improved GPU, but I had the impression that when you say GPU you really mean NVIDIA. Is this not the case anymore? From what I read on tensorflow has a CUDA backend, or a CPU backend, and CUDA is NVIDIA specific.


As far as I see, NVIDIA is the king:
On the other hand I was under the impression while NVIDIA cards were far more powerful, ATI had more processors, so why they shouldn’t be better for parallel processing?

(I need to see inside my HP Microserver G8, to see if there is room for a GPU :thinking: )


This is actually one of the reasons I am following mycroft and testing it. Independent infrastructure.
If you look at trends, there is probably the biggest potential for voice control/recognition within the home area in controlling things (car, home automation, etc.). All very private spaces.
You do not want to have someone listening all the time in your private space (no matter what their privacy policy says). Right now there are little alternatives, but with development over time a fully independent e.g. home automation would be quite a good selling point.
This translates to business area as well. Think of a voice controlled meeting room for presentations, etc. Again a company would want to have control about microphones in that room.
Hence a server would exactly deliver this. At first for nerds, with maturity of the technology for mass market.

Yes, I am looking on this very much on the privacy side of this. However you could also make technical cases for availability and reliability of internet access.

Features I would love to see:

  • Docker, makes deployment so much easier
  • Should be able to site serve, e.g. multiple devices in a home network
  • Clustered training data in public and private. E.g. I can gather and use training data locally combined with a central repository
  • Exchange training data, I can download an updated version regularly and if I want to, send my local data to the central repository.
  • Regarding use of GPU vs CPU, I would keep it open to both.
  • Central administration of clients via the server would be nice for larger deployments.


Intel dropped the phi experiment and has started plans to make discrete gpu’s themselves. That should tell you how they think it’s going in the cpu/gpu space.

If you have plans to run a multi-user setup, you will almost certainly want a gpu for deepspeech. For mimic2, you definitely will. A GPU doesn’t necessarily mean a $400 nvidia 1070, though. I can run dozens of sentences through deepspeech a minute on a 1030 ($80ish new). Mimic2 using the demo server isn’t speedy with a 1030, but it can work.

It’s more a matter of how much latency you’re comfortable with in your interactions.


I’m just getting started in the Mycroft world so please pardon my ignorance but as I understand the way things are currently, everything listed under “Why Would Anyone Need This?” Is already being addressed by the status quo. Am I wrong?

To me, the appeal of a separate server would be:

  1. The ability to use a more powerful machine, thereby implementing functionality that would require more horsepower than a RasPi is capable of.
  2. A distributed architecture that allows for “thin(ner) clients” around the home (i.e. Pi Zero W or Particle Photon) that can do audio-only processing for an integrated whole-home system.

I have several ideas about what item 2 might look like. Integration with an affordable, easily implemented, allways-on whole-house audio system would be pretty important.


@DonnyBahama. Essentially three parts to the process of interacting with Mycroft.
First your speech is converted to text (STT).
Second. the text is analysed and applied to a “Skill” that can handled your request. The skill returns a text response.
Third the response text is fed into a Text to Speech converter (TTS) and played to you.

The second and third parts are conducted on your local machine (PC, RPI, or Mark 1), however the First part requires a neural network that is very processor intensive, beyond that of household computers if you want a near real time experience (as in the answer doesn’t take 15 minutes to come back). At the moment the heavy work is done in the Cloud, using dedicated, high performance Enterprise servers. However this does mean your spoken words are sent over the internet. If this bothers you then the object of this exercise is to bring that server power local to you and keep all your interactions with Mycroft on your private network.


Thanks for the clarification. I wasn’t aware that the first part wasn’t done locally. Now that I know, I’m doubly supportive of a local server!


This is no different to Alexa. However in this case it is Amazon who have your speech. Be aware that only the speech recorded following you saying “Hey Mycroft” is transmitted. Its not sending everything it hears. Can we say the same for Alexa? Who knows!


Hi @steve.penrod, So what would you be thinking about, along the lines of a hardware platform for a server? Based on what I am reading here I’m getting a feel for a reasonable machine with an half decent GPU, perhaps not a $1000 beast but something reasonable. Any recommendations on the graphics card type?

Then for the OS, Linux server of some description?

I have PCI slots in the Proliant so adding a GPU isn’t out of the question.

Many thanks,


Check if your server supports power lines to the pci-e slots. Most don’t by default. If that’s the case then you’re stuck with an nvidia 1030 or similar. If it does, then you can probably get up to a 1050ti…beyond that requires additional power connectors that you’d have to hack into place. doing that you could get a 1070, which is a superb card for most of these things.

I have a desktop (i7 4770) with two 1030’s in it that handles deepspeech well, and mimic2 somewhat slowly.


It is WAY to early in the game to make any specific hardware recommendations. We need to keep an eye on price, but given that there is an option that is essentially free (just using Home) I don’t feel like that needs to be the primary decision maker.

Given the history of computing, if we push out a stack of software that runs on a system that costs $X next spring then we can fairly expect it will be able to run on a system that only cost half that within 18 months. And remember, the software itself is also progressing very rapidly. The DeepSpeech memory requirement dropped from something like 6GB to a few hundred MB, for example. This was exceptionally huge, but I’m certain there will be other significant software optimizations.

This is also part of the “does it have to be NVIDIA” discussion. Today TensorFlow only works with some GPUs, but I’m certain that is going to change over the next 6 months. A year ago there was no way to run TensorFlow on a Raspberry Pi, but now we have TensorFlow Lite. Is it exactly as capable? No. But many things can be done with it with minor modifications.


Hi Steve, Valid points and I hear what you are saying. I think it’ll be a case of when you are ready then the team of testers will see what we have to hand that will suit as test platform. I think that what you can draw from this is there is a keen interest for this development.

Of course another argument would be if, for some reason, your cloud processors were taken off line, then we all have dead Mycrofts. :open_mouth::disappointed_relieved: Perish the thought! However, something is funding their runtime and I’m sure its not cheap.

Shout when you are ready. :grinning: