I’m creating this topic just out of curiosity, to compare…
After reading a lot about STT I understand that doing it locally with DeepSpeech would be possible, but complicated and more importantly requires powerful hardware, especially a good graphics card. My local server couldn’t do the job.
So honestly, I’m grateful to have a cloud STT service that is not named G or A or similar… And for free, on top of that (I’m not counting the $20 I paid to become a supporter that’s nothing and I wasn’t even required to do so). However I have noticed delays in Mycroft’s response that seem to be due to the time necessary for the text to come back from the STT server. Which can have one of two causes:
(a) My limited bandwidth (1MB/s up on a good day)
(b) Delays on the STT server itself
… Or both.
I am talking delays between 1-2 seconds (perfectly acceptable) up to 10-12 seconds on a same command (i.e. simple things like “turn off office wall”).
So… What’s your experience?
Hi there @Old-Lodge-Skins, great question. Because we currently use cloud-based STT until we can bring DeepSpeech on to the device locally, there is definitely going to be some delay in the round trip trip for
Speech to Text ->
Intent Matching ->
There are a number of reasons that can cause a delay in working through this stack:
- The Device has to capture the audio first, and then upload it to the cloud STT processor. Different hardware will be able to do this faster than other hardware - for example an RPi 2 will capture this slower than a Xilinx processor.
- The audio file is then processed by api.mycroft.ai where it’s anonymized and then sent to the STT-in-the-cloud. This way it’s non-identifying. So this can add some overhead.
- The upload speed itself is a factor. I measure my upload in
Kbps so I feel your pain
- Then the STT servers themselves have to process the STT and send it back, where the transcription is interpreted.
I’d love to be able to get some quantifiable metrics on this round-trip process at some stage, it’s just not something I’ve dug into. I wonder if a tool like Selenium might be able to assist there…
Thanks for the detailed explanation @KathyReid .
I thought DeepSpeech was already in function actually
Anyway… If there’s anything I can do to help you test just let me know.
I don’t think the hardware is at fault here since I have recycled my old laptop, its Corei5 mobile is definitely faster than any Raspberry IMHO.
Anyway, that was just pure curiosity really…
Thanks so much for your offer of help! There’s definitely a lot we need help with, I think one of my challenges is how do I present that in a navigable way that makes it easier for people to say “yeah, I could help with that!”.
On the DeepSpeech side, we can configure a Device to use DeepSpeech, but it’s CPU/GPU heavy so not recommended for lower-power ARM-based Devices like the RPi.
Yes I knew that about DeepSpeech, that’s why I didn’t even try (well, not yet - my local sever is a Core i3 with an old graphics card, I’m not expecting it to do that job in real time, and the best I have in store are a pair of Nvidia 9800 which are pre-Cuda that won’t do it either).