There’s been plenty of talk about it but I haven’t seen anyone give it a go unfortunately. I’m particularly interested to hear how well they handle running a DeepSpeech STT service.
It would require some work to make use of all the available system resources.
For the Mark II this board would unfortunately be too expensive as there are a range of other costs to consider, particularly a Mic Array for which we’re using the Seeed Mic Array v2. The Mark III however…
The nano is a pretty beefy SBC with a small GPU on board, and lots of support. Even then, one community member has tried a bunch of tools on it and hasn’t had great luck with it working well. This doesn’t look as capable as the nano, so probably not very well. Might be useful for wakeword spotting?
Bummer, well it does seem like there a fair amount of people looking at it so hopefully there will be a chipset which does the work we want at lower power than a general device. I’ve never understood if the matrix chips were if they could help either, but I’m assuming people are looking at systems like those too.
I have a Jetson Nano and can confirm that Mycroft runs on it, but Mycroft does not utilize the GPU/CUDA cores out of the box. The CPU is a ARM Cortex-57 which is better/faster than the Cortex-53 of the RPI3. Overall the performance Mycroft running on the Jetson Nano is a bit better, but the 4GB RAM help a bit performance wise as well.
One pain point is that the Jetson Nano is aarchlinux/arm64 architecture where you sometimes have trouble finding pre-built software packages (apt an pypi). So exspect to be building packages yourself from scratch or not being able to use some pieces of software at all.
Right now i am testing different STT and TTS systems on the Nano.
For STT i got Kaldi, Zamia (based on Kaldi) and Mozilla Deepspeech running. While performance is o.k. for most of them, the quality of the pre-trained models provided is still bad, e.g. a phrase like “turn off the light” sometimes needs five attempts until the system got it right.
On the TTS side I got Tacotron, Tacotron2 and WaveRNN in different flavors running. Again the pre-trained models differ in quality and there is the trade-off between quality and performance, e.g. the very good sounding WaveRNN models require more than 10 seconds of processing for one second of audio.
Has anyone tried running the trained tensorFlow .pb through tensorRT? You can run it on the Nano or on a PC with an NVIDIA GPU. It should optimize the model for faster inference performance, better memory utilization, and ensure it uses all the goodness on the NVIDIA hardware.
After getting it to work stock, you can speed it up further.
Depending on the level of precision the model was trained at, you can also trying dropping it to fp16 or int8. It could dramatically reduce the model size and memory required, speed up the inference while it may only see a nominal decrease in accuracy. If I get some time in the coming weeks, I’ll see if I can test it out myself.
Hi Dominic. I noticed your new release of DeepSpeech here. Is the quality still lacking or has it improved since last September?
(I’m asking since I’m curious whether to buy a Nano to replace Alexa or not ˸)
DeepSpeech 0.8.x as a software stack has seen a lot of improvements
The STT “quality” is still dependent on a number of factors like input signal quality (you can run denoiser like RNNoise before passing the audio to DeepSpeech) and the language you want to use. The english pre-trained models are now at a WER (word error rate) of less than 9% (actually 6-7% if I remember correctly). For other languages the situation is different, e.g. the best german model I know of still has a WER of 15% which in my opinion is not feasible for the Mycroft use-case scenario.
Note: I have customized the mimic2_tts.py (replace line req_route = self.url + "/synthesize?text=" + sentence with req_route = self.url + sentence ). In mycroft-core dev-branch there is also a new module mozilla_tts.py