I wanted to check in on where Mycroft’s STT roadmap is currently at? Use of Google whilst understandable, is only temporary I guess? Is Mozilla getting anywhere close to something useful? Will the new Mycroft enclosure processor be able to run STT locally?
I’ve seen the upcoming Alexa will be processing transcription locally and Pixel has sufficient hardware to run the Google Assistant, so Mycroft relying on Google cloud is becoming increasingly counter-intuitive privacy-wise?
Mozilla discontinued the development for DeepSpeech. The former Mozilla Dev-team founded coqui.ai and continues development on STT (and TTS). Their latest pretrained STT-model for English has a WER of 4.5% (on Libri-speech clean test dataset), which I would consider “usable”. In case you have a CUDA-GPU you may look at Nvidia models, e.g. Conformer-Transducer, which has even better WER of 1.7%.
In my tests on a Xavier-AGX the Conformer-Transducer Model for German language has a real-time-factor of 0.127x (one second of audio input was processed in 0.127 second). I don’t remember how much of the 32GB RAM was used.
As the Xavier-AGX GPU much faster than the Jetson Nano (32TOPS vs 0.5TOPS) you might end up with a RTF of >>1x - which might be inconvenient for daily usage…
Took a bit of fiddling to get the environment set up on Rasbian buster but ultimately pipenv helped me out.
With that in place you get quite a speed up in the “realtimeness” even if it all runs on a Raspi 4 with 4GB of RAM. Probably easier than getting a Jetson Nano setup working.
Deepspeech is dead - long live coqui.ai STT, which has a streaming API as well, don’t know if the streaming server mentioned above is still working with it…
Yeah the deepspeech guys jumped to create coqui when the funding dried up so its really deepspeech in disguise with some newer bits and bobs.
I haven’t tried espnet so clueless to load and what is needed to run but currently I think its the one to beat as kaldi has lost some ground.
is quite interesting as they make no boasts and call it “Almost State-of-the-art Automatic Speech Recognition in Tensorflow 2.” but the tflite version is really lite and speedy and need to give it a revisit.
Deepspeech/coqui last time I looked sort of ended up in no-mans land where it was neither stateofart or that lite as just managing realtime on a Pi4 but haven’t looked if the have managed to optimise since.
don’t know if the streaming server mentioned above is still working with it
me neither. there was an issue on another deepspeech server on github posted by a coqui contributor, urging the implementer to move to coqui so it sounds like at least a little fiddling with the code is needed. maybe just a change in dependency + imports but I can’t say for sure.
plus AFAIK, the first model of coqui was basically just a rebranded version of the last deepspeech model. They have a more recent one which is supposed to perform better but there’s no “tflite” version of it (yet?). So for a raspi-only solution but with streaming definitely working, deepspeech 0.9.3 is still a viable option.