The Mycroft Personal Server - Starting the Conversation

Originally published at: http://mycroft.ai/blog/mycroft-personal-server-conversation/

In my July post where I introduced the Mycroft Roadmaps, I laid out plans for a Mycroft Personal Server. I’ve had conversations with many about the concept, but exact goals and designs haven’t been established yet. I see this as a highly Community-centric project, so I’d like to start a conversation so we can all get on the same page.

What is it?

Mycroft is inherently modular, allowing pieces of the system to be moved around as appropriate for the specific implementation. Up to this point, the typical implementation runs the majority of Mycroft on a device such as a Mark 1, Raspberry Pi, or laptop. This includes Wake Word processing, intent parsing, and text to speech(TTS) (more on Mimic2 TTS below).

For normal operation, there is one critical piece isn’t included in that list – Speech to Text (STT). The typical Mycroft device today uses Mycroft’s Home to perform the STT operation. This is automatic and invisible to most users.

The server also offers several other services that are important:

In my view, the Personal Server would provide some version of all of these services. It should allow a household to run all of their Mycroft equipment without any network necessary until a Skill needs to access the internet to retrieve information.

This means the personal server would at minimum need to run Speech to Text (DeepSpeech), Text to Speech (Mimic), and provide a configuration web interface.

Why would anyone need this?

There are several very good reasons for implementing this capability.
  • Slow, unreliable internet - I’m personally spoiled by Google Fiber here in Kansas City and forget that not everyone in the world has gigabit connection speeds.
  • Limited or expensive internet - Similar to the above, but slightly different motivation
  • No internet - Yes, this exists. Imagine locations in the mountains, on boats, or the far side of the moon.
  • Privacy concerns - Every time data leaves your control, there is a possibility of others misusing it, not safeguarding it adequately, or it being intercepted.
For those willing to accept the responsibility, keeping all operations within a home provides the ultimate in reliability and security.

What a Personal Server Isn’t

The Personal Server is intended to be Personal -- not Enterprise Grade. The main reason for this is simplicity. For example, if you don’t have to perform Speech to Text requests for thousands of users, the odds of collision are very low. That means STT requests can be run sequentially instead of requiring a bank of STT servers that can handle a high load.

A Personal Server also isn’t for everyone. You don’t have to be a High Geek but it will require some significant computational resources, like an always-on PC with a high-quality GPU.

Does this mean Home is no longer needed?

No, for several reasons.

Firstly, many people will still want the convenience of just plugging in their device and running it. No worries about setting up a server, no challenges accessing the web UI from their phone without firewall magic, etc. It just works.

Second, there is still value in having a central collaboration hub. Mycroft has always been about communal efforts, and Community requires gathering places. Home provides a place to:

  • Share and assist in tagging data to advance the technology
  • Discover new information
  • Download voices, and skills from others
  • Provide a gateway to access other Mycroft devices and Mycroft-hosted services

Your Thoughts?

All of the above are my thoughts. But as I said at the beginning, I want this to be a conversation. What do you want and see for the Mycroft Personal Server? Are there concerns I’m overlooking? Would you like to be involved in the building of this, taking control of your own fate?

Please let us know your thoughts below!

16 Likes

Hi Steve,

This sounds very interesting. Agreed Enterprise grade machines are not for the home, however as they say, there is always one. Me! I happen to have a reasonably pokey HP Proliant DL380 Generation 7. It is sporting dual Hexcore Xeon processors, hyperthreaded, hence 24 cores, clocking at 2.93GHz and 32 Gb of RAM. It also has about 500Gb of disk space on a low latency RAID5 array. However, it lacks the GPU part.

It is currently uncommitted and has a straight install of Ubuntu 16.04 server (no hypervisor).

The question is, is this platform worthy of a trial?

I am reasoning that when a packet of work comes in from a Mycroft front end, then the power is drawn, following which it backs off. On that basis, setting the machine to run distributed computing tasks, such as folding@home, makes use of the wasted idle energy.

That is the down side, it is thirsty and chews 300 watts regardless. Plus, if you go to YouTube and find a video of a B52 bomber idling, yeap, that’s what it sounds like.

Out of interest, what does your cloud based platform spec look like?

Many thanks,

Dave

If you want to run STT or TTS in anywhere near real-time, you need a GPU. Even on an i7 you can’t come close to the performance of a GPU for things like this.

At the same time, the underlying technology is changing very rapidly too. With DeepSpeech, the training sessions this spring took 2 weeks on 2 multi-GPU machines. After some major rearchitecture this summer, it now can run a training session against even more data on 1 of those machines in 3 days. So… specing exact hardware is probably not the way to start. I think we should begin building the technology and see what is needed to support it once we have all the basic pieces in place.

4 Likes

Hello Steve,

I am quite interested in a Mycroft Server. I tend to run many local network services vs using the equivalent cloud services.

A few things I’d like to see in a Mycroft server.

  • Docker/Docker compose as a deployment option
    • I recently moved from a KVM based environment to docker and I’m running more services on the same hardware. Begin able to deploy/update this server using docker would be great.
  • Incubate features.
    • Users of the local server could opt in to beta releases/features and do an initial round of testing before that feature is deployed to the public server.
  • Optional Paid Support.
    • While the intent clearly states personal and not for business. A server will be deployed in/on various network/hardware configurations and there will be a support cost. This could be structured in a way where all the information to troubleshoot the server would be available to those who look and completely optional.
  • Client to Server version support/compatibility.
    • What is the contract for backwards compatibility. Will the server always support the oldest to the newest client?
    • Will the server be able to support clients with varying versions?
  • Ability to update similar clients with a single image.
    • Would be nice for the server to download a single update image per client type. That way if you have multiple Mark 1’s the server downloads the update/image once and the clients update via the server.

I agree with all of your points and am looking forward to a local Mycroft server… though it sounds like I’ll be needing a dedicated GPU to run it :slight_smile:

1 Like

Hi,
due to mozillas blog post, it should be possible to run a 3s file in 1.5s on a “laptop cpu” with deepspeech 0.2:
https://hacks.mozilla.org/2018/09/speech-recognition-deepspeech/
So no need for a GPU anymore :thinking:

1 Like

Ah, I hadn’t see that blog by Reuben, just heard him saying it was faster – that is closer to acceptable! As I said, this is a moving target and it might be possible to operate without one. Some components, like the Mimic2 voice generation, still perform much better on a GPU. Regardless, I think it should be part of this projects goal to not only allow a user to run a basic Mycroft experience but also run Mycroft as optimally as possible. That would certainly include whatever is necessary to support GPU utilization.

2 Likes

Super glad to hear about the home server. I like to host (almost) everything by my own, and let geek users to host themselves the services can be a good point. Happy to read a GPU is not needed anymore, because many microservers has powerful Xeons processors with sad integrated GPUs

Besides the obvious benefits you told before (connectivity and security), can be performance another benefit, if finally fast GPUs aren’t needed as requirement? I mean, mimic2 is quite slow here (on a 300Mb/300Mb fiber connection) Home server would improve this as is entirely dedicated to my few mycroft devices?

I would like also contribute with some ideas (and testing when you finally decide to release it):

  • Docker containers: as @sampsonight suggested. The docker approach is better than git pulling as we do with mycroft-core: no matter what distro are you using, what libraries do you have, it will install with YOUR mycroft image. On the other hand, as mycroft is not IO-intensive, it doesn’t need too much kernel context changes, so containerization will provide full hardware access and easiness for deploy.

  • Federation: as a community, we can federate ourselves, and… I don’t know, whatever we want: create a mesh network, provide some CPU power to those less capable, sharing something useful. Surely you can think about something useful if you can count with some hundreds more servers. Obviously, federation must be optional, and it should be profitable for the users (premium voices and so, perhaps?) and for you to invest time in developing it (CPU time and horsepower to do some secondary tasks like training voices or the like). Federation algorithm should be fair enough to use the federated CPUs when idle (or low usage) and have less priority and niceness to let usable the home user server.

3 Likes

I am glad to hear about the possibility of running without the GPU. Assuming (As I haven’t checked recently) that the effect that Cryptocurrency mining had on the price a decent graphics cards is still in effect. A quick look would say “Yes it is” then really it is hard to justify the cost of building the hardware platform to host the server, if a GPU is mandatory.

As I said earlier if it can serve a purpose then I’ll happily have a go with my DL380.

I don’t want to totally promise “no GPU” – neural networks are REALLY well suited to run on the massively parallel architecture of a GPU. If you don’t know why, here is a little info on it.

However, even a cheapo GPU can be valuable for this. For example, even the cheezy GPU on a Raspberry Pi can run some ML type tasks 3x faster than the CPU.

As for the cost of GPUs, they’ve dropped significantly since the ridiculous spike this spring. Only time will tell where this goes in the long run, but I’m assuming Moore’s law will make the mid- to low- power GPUs cheaper, as the crypto folks really don’t get any advantage running one of them as a miner.

4 Likes

Very valid point @steve.penrod. It does strike me though that there will be an exponential relationship where the number of parallel high performance CPU cores crosses the point where the real time advantage of a GPU becomes minimal. Do you have any idea how many parallel threads are running in the Deep speech platform?

I tend to use Intel + integrated CPU nowadays on my Linux computers because anything else gives problems at one time or another…

Current devices have a somewhat improved GPU, but I had the impression that when you say GPU you really mean NVIDIA. Is this not the case anymore? From what I read on tensorflow has a CUDA backend, or a CPU backend, and CUDA is NVIDIA specific.

As far as I see, NVIDIA is the king: https://www.tensorflow.org/performance/benchmarks
On the other hand I was under the impression while NVIDIA cards were far more powerful, ATI had more processors, so why they shouldn’t be better for parallel processing?

(I need to see inside my HP Microserver G8, to see if there is room for a GPU :thinking: )

This is actually one of the reasons I am following mycroft and testing it. Independent infrastructure.
If you look at trends, there is probably the biggest potential for voice control/recognition within the home area in controlling things (car, home automation, etc.). All very private spaces.
You do not want to have someone listening all the time in your private space (no matter what their privacy policy says). Right now there are little alternatives, but with development over time a fully independent e.g. home automation would be quite a good selling point.
This translates to business area as well. Think of a voice controlled meeting room for presentations, etc. Again a company would want to have control about microphones in that room.
Hence a server would exactly deliver this. At first for nerds, with maturity of the technology for mass market.

Yes, I am looking on this very much on the privacy side of this. However you could also make technical cases for availability and reliability of internet access.

Features I would love to see:

  • Docker, makes deployment so much easier
  • Should be able to site serve, e.g. multiple devices in a home network
  • Clustered training data in public and private. E.g. I can gather and use training data locally combined with a central repository
  • Exchange training data, I can download an updated version regularly and if I want to, send my local data to the central repository.
  • Regarding use of GPU vs CPU, I would keep it open to both.
  • Central administration of clients via the server would be nice for larger deployments.
4 Likes

Intel dropped the phi experiment and has started plans to make discrete gpu’s themselves. That should tell you how they think it’s going in the cpu/gpu space.

If you have plans to run a multi-user setup, you will almost certainly want a gpu for deepspeech. For mimic2, you definitely will. A GPU doesn’t necessarily mean a $400 nvidia 1070, though. I can run dozens of sentences through deepspeech a minute on a 1030 ($80ish new). Mimic2 using the demo server isn’t speedy with a 1030, but it can work.

It’s more a matter of how much latency you’re comfortable with in your interactions.

I’m just getting started in the Mycroft world so please pardon my ignorance but as I understand the way things are currently, everything listed under “Why Would Anyone Need This?” Is already being addressed by the status quo. Am I wrong?

To me, the appeal of a separate server would be:

  1. The ability to use a more powerful machine, thereby implementing functionality that would require more horsepower than a RasPi is capable of.
  2. A distributed architecture that allows for “thin(ner) clients” around the home (i.e. Pi Zero W or Particle Photon) that can do audio-only processing for an integrated whole-home system.

I have several ideas about what item 2 might look like. Integration with an affordable, easily implemented, allways-on whole-house audio system would be pretty important.

@DonnyBahama. Essentially three parts to the process of interacting with Mycroft.
First your speech is converted to text (STT).
Second. the text is analysed and applied to a “Skill” that can handled your request. The skill returns a text response.
Third the response text is fed into a Text to Speech converter (TTS) and played to you.

The second and third parts are conducted on your local machine (PC, RPI, or Mark 1), however the First part requires a neural network that is very processor intensive, beyond that of household computers if you want a near real time experience (as in the answer doesn’t take 15 minutes to come back). At the moment the heavy work is done in the Cloud, using dedicated, high performance Enterprise servers. However this does mean your spoken words are sent over the internet. If this bothers you then the object of this exercise is to bring that server power local to you and keep all your interactions with Mycroft on your private network.

2 Likes

Thanks for the clarification. I wasn’t aware that the first part wasn’t done locally. Now that I know, I’m doubly supportive of a local server!

1 Like

This is no different to Alexa. However in this case it is Amazon who have your speech. Be aware that only the speech recorded following you saying “Hey Mycroft” is transmitted. Its not sending everything it hears. Can we say the same for Alexa? Who knows!

Hi @steve.penrod, So what would you be thinking about, along the lines of a hardware platform for a server? Based on what I am reading here I’m getting a feel for a reasonable machine with an half decent GPU, perhaps not a $1000 beast but something reasonable. Any recommendations on the graphics card type?

Then for the OS, Linux server of some description?

I have PCI slots in the Proliant so adding a GPU isn’t out of the question.

Many thanks,
Dave

Check if your server supports power lines to the pci-e slots. Most don’t by default. If that’s the case then you’re stuck with an nvidia 1030 or similar. If it does, then you can probably get up to a 1050ti…beyond that requires additional power connectors that you’d have to hack into place. doing that you could get a 1070, which is a superb card for most of these things.

I have a desktop (i7 4770) with two 1030’s in it that handles deepspeech well, and mimic2 somewhat slowly.