Not sure if it makes sense as the WER % drops off a cliff for the tiny & base models (suposedly from another reviewer) but yeah for a larger but dunno about running those on CPU as say running on GPU after some time screaming at my computer and trying to install cuda11.6 on ubuntu 22.04, use one of the Nvidia docker containers instead as I give up!
But install the right torch 1st
pip3 install torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/cu116
Then install Whisper
pip install git+https://github.com/openai/whisper.git
Using https://commons.wikimedia.org/wiki/File:Reagan_Space_Shuttle_Challenger_Speech.ogv 4m:48s
time whisper Reagan_Space_Shuttle_Challenger_Speech.ogv --best_of None --beam_size None --model medium.en --threads=8
| real | 0m42.072s |
|---|---|
| user | 0m46.303s |
| sys | 0m3.591s |
time whisper Reagan_Space_Shuttle_Challenger_Speech.ogv --best_of None --beam_size None --model small.en --threads=8
| real | 0m22.323s |
|---|---|
| user | 0m24.127s |
| sys | 0m2.545s |
time whisper Reagan_Space_Shuttle_Challenger_Speech.ogv --best_of None --beam_size None --model base.en --threads=8
| real | 0m13.119s |
|---|---|
| user | 0m14.324s |
| sys | 0m2.137s |
time whisper Reagan_Space_Shuttle_Challenger_Speech.ogv --best_of None --beam_size None --model tiny.en --threads=8
| real | 0m10.855s |
|---|---|
| user | 0m11.907s |
| sys | 0m2.106s |
So thinking even though that is only a RTX3050 desktop GPU tad slower than a GTX1070 from memory it still beats the pants out of running on CPU unless something pretty awesome.
With GPU when testing run twice as the model load into vram accounts for much but 2nd run is far faster I think it is purely the model load.
For Mac users on Arm GitHub - ggerganov/whisper.cpp: Port of OpenAI's Whisper model in C/C++ is amazing but maybe not so if you have a GPU… Just been reading the Metal framework has been avail on pytorch since June.
PS the ‘review’ was when I was just googling and knew nothing about Whisper and just coincidence we have all seemed found it, thinking about it I have no idea if the review was correct as thinking about it it did seem a bit critical. Said Whisper is very good but occasionally it gets things totally wrong and that it could of just been the multilingual tiny & base models it was critical of and that published WER was optimistic and it just stuck in my head and could be overly critical?
Just to go on about about WER/Load https://arxiv.org/pdf/2005.03191.pdf
ContextNet achieves a word error rate (WER) of 2.1%/4.6% without external language model (LM), 1.9%/4.1% with LM and 2.9%/7.0% with only 10M parameters on the clean/noisy LibriSpeech test sets
Which is likely a much better fit for a Pi4 with 10M parameters being a quarter of the whisper Tiny model and very likely directly converts to inference speed.
I have always liked GitHub - TensorSpeech/TensorFlowASR: TensorFlowASR: Almost State-of-the-art Automatic Speech Recognition in Tensorflow 2. Supported languages that can use characters or subwords and maybe because of my preference over pytorch that you can do the same things with PyTorch but with TFLite I have a reasonable knowledge how easy it is to use a TFlite Coral Delegate, or Mali or whatever or partition a model so it runs across several simultaneously of cpu/gpu/npu which is why I have the RK3588.
Same with TTS with GitHub - TensorSpeech/TensorFlowTTS: TensorFlowTTS: Real-Time State-of-the-art Speech Synthesis for Tensorflow 2 (supported including English, French, Korean, Chinese, German and Easy to adapt for other languages) as conversion to Tflite and support for embedded and accelerators seem much better, or at least was and now its because I am dodging Pytorch I am lacking knowledge.
