Mycroft on Nvidia Jetson Nano


#1

Hi,

I was wondering if Mycroft can run on Nvidia Jetson nano deployed with CUDA. With 128 cuda cores and 4GB running a Linux4Tegra OS in such hardware has the potential of making Mycroft run pretty fast.

What I don’t know is if there is any easy way to make Mycroft software use of the hundred cuda cores.

Has the Mycroft team reviewed this option when choosing the SBC for the Mark II and III? And if so, why it has chosen other way? just curious.

/Daniel


#2

Hi Daniel,

There’s been plenty of talk about it but I haven’t seen anyone give it a go unfortunately. I’m particularly interested to hear how well they handle running a DeepSpeech STT service.

It would require some work to make use of all the available system resources.

For the Mark II this board would unfortunately be too expensive as there are a range of other costs to consider, particularly a Mic Array for which we’re using the Seeed Mic Array v2. The Mark III however…


#3

I saw Adafruit is working on a similar board: https://blog.adafruit.com/2019/09/02/machine-learning-monday-braincraft-hat-for-raspberry-pi-and-single-board-linux-computers-adafruit-raspberry_pi-tensorflow-machinelearning-tinyml-raspberrypi/ how well could something like this work?


#4

The nano is a pretty beefy SBC with a small GPU on board, and lots of support. Even then, one community member has tried a bunch of tools on it and hasn’t had great luck with it working well. This doesn’t look as capable as the nano, so probably not very well. Might be useful for wakeword spotting?


#5

Bummer, well it does seem like there a fair amount of people looking at it so hopefully there will be a chipset which does the work we want at lower power than a general device. I’ve never understood if the matrix chips were if they could help either, but I’m assuming people are looking at systems like those too.

Thanks Baconater!


#6

I have a Jetson Nano and can confirm that Mycroft runs on it, but Mycroft does not utilize the GPU/CUDA cores out of the box. The CPU is a ARM Cortex-57 which is better/faster than the Cortex-53 of the RPI3. Overall the performance Mycroft running on the Jetson Nano is a bit better, but the 4GB RAM help a bit performance wise as well.

One pain point is that the Jetson Nano is aarchlinux/arm64 architecture where you sometimes have trouble finding pre-built software packages (apt an pypi). So exspect to be building packages yourself from scratch or not being able to use some pieces of software at all.

Right now i am testing different STT and TTS systems on the Nano.

For STT i got Kaldi, Zamia (based on Kaldi) and Mozilla Deepspeech running. While performance is o.k. for most of them, the quality of the pre-trained models provided is still bad, e.g. a phrase like “turn off the light” sometimes needs five attempts until the system got it right.

On the TTS side I got Tacotron, Tacotron2 and WaveRNN in different flavors running. Again the pre-trained models differ in quality and there is the trade-off between quality and performance, e.g. the very good sounding WaveRNN models require more than 10 seconds of processing for one second of audio.


#7

HI Dominik, do you have a GitHub code for implementation of Tacotron on Jetson Nano? I am new to Jetson Nano and new to creating my own hacks in general, so any help would be appreciated :slight_smile: !


#8

I tried several implementations:

Reason for choosing these was the availability of pre-trained models for inference as training such models on the Jetson Nano is not feasible (only 4GB RAM, GPU too limited for this purpose).

For performance etc. see also my posts in the Mycroft-Chat channel ~machine-learning


#9

Has anyone tried running the trained tensorFlow .pb through tensorRT? You can run it on the Nano or on a PC with an NVIDIA GPU. It should optimize the model for faster inference performance, better memory utilization, and ensure it uses all the goodness on the NVIDIA hardware.

https://docs.nvidia.com/deeplearning/frameworks/tf-trt-user-guide/index.html

After getting it to work stock, you can speed it up further.
Depending on the level of precision the model was trained at, you can also trying dropping it to fp16 or int8. It could dramatically reduce the model size and memory required, speed up the inference while it may only see a nominal decrease in accuracy. If I get some time in the coming weeks, I’ll see if I can test it out myself.