Adding "Whisper" as local STT option?

The latest OpenVoiceOS images do by default ship with libwhispercpp installed together with the STT @JarbasAl linked above.

Interesting to see how the developemnt will go as it looks like it goes in a rapid pace.

Here are some benchmarks and test similar as what @StuartIanNaylor posted with WhisperCPP cross compiled within the whole buildroot system. I might redo them later with libwhispercpp compiled with the OpenBLAS option.

Benchmark - tiny model (as that is the default of the STT plugin)

mycroft@OpenVoiceOS-e3830c:~ $ ./bench -m ggml-tiny.bin -t 4
whisper_model_load: loading model from 'ggml-tiny.bin'
whisper_model_load: n_vocab       = 51865
whisper_model_load: n_audio_ctx   = 1500
whisper_model_load: n_audio_state = 384
whisper_model_load: n_audio_head  = 6
whisper_model_load: n_audio_layer = 4
whisper_model_load: n_text_ctx    = 448
whisper_model_load: n_text_state  = 384
whisper_model_load: n_text_head   = 6
whisper_model_load: n_text_layer  = 4
whisper_model_load: n_mels        = 80
whisper_model_load: f16           = 1
whisper_model_load: type          = 1
whisper_model_load: mem_required  = 390.00 MB
whisper_model_load: adding 1608 extra tokens
whisper_model_load: ggml ctx size =  73.58 MB
whisper_model_load: memory size =    11.41 MB
whisper_model_load: model size  =    73.54 MB

system_info: n_threads = 4 / 4 | AVX2 = 0 | AVX512 = 0 | NEON = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 |

whisper_print_timings:     load time =  1693.92 ms
whisper_print_timings:      mel time =     0.00 ms
whisper_print_timings:   sample time =     0.00 ms
whisper_print_timings:   encode time = 20830.68 ms / 5207.67 ms per layer
whisper_print_timings:   decode time =     0.00 ms / 0.00 ms per layer
whisper_print_timings:    total time = 22524.77 ms

If you wish, you can submit these results here:

  https://github.com/ggerganov/whisper.cpp/issues/89

Please include the following information:

  - CPU model
  - Operating system
  - Compiler
mycroft@OpenVoiceOS-e3830c:~ $ ./main -m ggml-tiny.bin -f jfk.wav -t 4
whisper_model_load: loading model from 'ggml-tiny.bin'
whisper_model_load: n_vocab       = 51865
whisper_model_load: n_audio_ctx   = 1500
whisper_model_load: n_audio_state = 384
whisper_model_load: n_audio_head  = 6
whisper_model_load: n_audio_layer = 4
whisper_model_load: n_text_ctx    = 448
whisper_model_load: n_text_state  = 384
whisper_model_load: n_text_head   = 6
whisper_model_load: n_text_layer  = 4
whisper_model_load: n_mels        = 80
whisper_model_load: f16           = 1
whisper_model_load: type          = 1
whisper_model_load: mem_required  = 390.00 MB
whisper_model_load: adding 1608 extra tokens
whisper_model_load: ggml ctx size =  73.58 MB
whisper_model_load: memory size =    11.41 MB
whisper_model_load: model size  =    73.54 MB

system_info: n_threads = 4 / 4 | AVX2 = 0 | AVX512 = 0 | NEON = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 |

main: processing 'jfk.wav' (176000 samples, 11.0 sec), 4 threads, 1 processors, lang = en, task = transcribe, timestamps = 1 ...


[00:00:00.000 --> 00:00:11.000]   And so my fellow Americans ask not what your country can do for you ask what you can do for your country.


whisper_print_timings:     load time =   930.82 ms
whisper_print_timings:      mel time =   325.40 ms
whisper_print_timings:   sample time =    34.11 ms
whisper_print_timings:   encode time = 21394.58 ms / 5348.65 ms per layer
whisper_print_timings:   decode time =  1241.69 ms / 310.42 ms per layer
whisper_print_timings:    total time = 23929.93 ms

Strangely I am sat infront of a mic retrying the streaming mode as I never checked out the streaming branch before so prob why results where bad.
Haven’t got a Pi4 anymore so running on a Rock5b RK3588.
Streaming has had an update and seems really good but not sure on short command sentences if streaming mode is worthwhile when latency is so short and non streaming accuracy is really really good.

I will post a bench here again just as check to see the non stream perf as with streaming mode you can only really check load as you are feeding a stream that is a bit hard to measure.
Still the old rules apply you need to get a good signal from your Mic in terms of volume which often is pretty poor without putting on a AGC and volume to max.

I can show you load no Pi4 to try on though.

No BLAS

./main -m models/ggml-tiny.en.bin -f samples/jfk.wav
whisper_model_load: loading model from 'models/ggml-tiny.en.bin'
whisper_model_load: n_vocab       = 51864
whisper_model_load: n_audio_ctx   = 1500
whisper_model_load: n_audio_state = 384
whisper_model_load: n_audio_head  = 6
whisper_model_load: n_audio_layer = 4
whisper_model_load: n_text_ctx    = 448
whisper_model_load: n_text_state  = 384
whisper_model_load: n_text_head   = 6
whisper_model_load: n_text_layer  = 4
whisper_model_load: n_mels        = 80
whisper_model_load: f16           = 1
whisper_model_load: type          = 1
whisper_model_load: mem_required  = 390.00 MB
whisper_model_load: adding 1607 extra tokens
whisper_model_load: ggml ctx size =  73.58 MB
whisper_model_load: memory size =    11.41 MB
whisper_model_load: model size  =    73.54 MB

system_info: n_threads = 4 / 8 | AVX2 = 0 | AVX512 = 0 | NEON = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 |

main: processing 'samples/jfk.wav' (176000 samples, 11.0 sec), 4 threads, 1 processors, lang = en, task = transcribe, timestamps = 1 ...


[00:00:00.000 --> 00:00:07.540]   And so my fellow Americans ask not what your country can do for you
[00:00:07.540 --> 00:00:10.160]   ask what you can do for your country.
[00:00:10.160 --> 00:00:30.000]   You can do for your country


whisper_print_timings:     load time =   305.88 ms
whisper_print_timings:      mel time =   134.55 ms
whisper_print_timings:   sample time =    11.85 ms
whisper_print_timings:   encode time =   802.06 ms / 200.51 ms per layer
whisper_print_timings:   decode time =   321.21 ms / 80.30 ms per layer
whisper_print_timings:    total time =  1576.53 ms

Benchmark - base model

mycroft@OpenVoiceOS-e3830c:~ $ ./bench -m models/ggml-base.bin -t 4
whisper_model_load: loading model from 'models/ggml-base.bin'
whisper_model_load: n_vocab       = 51865
whisper_model_load: n_audio_ctx   = 1500
whisper_model_load: n_audio_state = 512
whisper_model_load: n_audio_head  = 8
whisper_model_load: n_audio_layer = 6
whisper_model_load: n_text_ctx    = 448
whisper_model_load: n_text_state  = 512
whisper_model_load: n_text_head   = 8
whisper_model_load: n_text_layer  = 6
whisper_model_load: n_mels        = 80
whisper_model_load: f16           = 1
whisper_model_load: type          = 2
whisper_model_load: mem_required  = 506.00 MB
whisper_model_load: adding 1608 extra tokens
whisper_model_load: ggml ctx size = 140.60 MB
whisper_model_load: memory size =    22.83 MB
whisper_model_load: model size  =   140.54 MB

system_info: n_threads = 4 / 4 | AVX2 = 0 | AVX512 = 0 | NEON = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 |

whisper_print_timings:     load time =  2274.95 ms
whisper_print_timings:      mel time =     0.00 ms
whisper_print_timings:   sample time =     0.00 ms
whisper_print_timings:   encode time = 56033.60 ms / 9338.93 ms per layer
whisper_print_timings:   decode time =     0.00 ms / 0.00 ms per layer
whisper_print_timings:    total time = 58308.86 ms

If you wish, you can submit these results here:

  https://github.com/ggerganov/whisper.cpp/issues/89

Please include the following information:

  - CPU model
  - Operating system
  - Compiler
mycroft@OpenVoiceOS-e3830c:~ $ ./main -m models/ggml-base.bin -f samples/jfk.wav -t 4
whisper_model_load: loading model from 'models/ggml-base.bin'
whisper_model_load: n_vocab       = 51865
whisper_model_load: n_audio_ctx   = 1500
whisper_model_load: n_audio_state = 512
whisper_model_load: n_audio_head  = 8
whisper_model_load: n_audio_layer = 6
whisper_model_load: n_text_ctx    = 448
whisper_model_load: n_text_state  = 512
whisper_model_load: n_text_head   = 8
whisper_model_load: n_text_layer  = 6
whisper_model_load: n_mels        = 80
whisper_model_load: f16           = 1
whisper_model_load: type          = 2
whisper_model_load: mem_required  = 506.00 MB
whisper_model_load: adding 1608 extra tokens
whisper_model_load: ggml ctx size = 140.60 MB
whisper_model_load: memory size =    22.83 MB
whisper_model_load: model size  =   140.54 MB

system_info: n_threads = 4 / 4 | AVX2 = 0 | AVX512 = 0 | NEON = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 |

main: processing 'samples/jfk.wav' (176000 samples, 11.0 sec), 4 threads, 1 processors, lang = en, task = transcribe, timestamps = 1 ...


[00:00:00.000 --> 00:00:07.600]   And so my fellow Americans ask not what your country can do for you,
[00:00:07.600 --> 00:00:10.600]   ask what you can do for your country.


whisper_print_timings:     load time =  1338.00 ms
whisper_print_timings:      mel time =   287.82 ms
whisper_print_timings:   sample time =    41.32 ms
whisper_print_timings:   encode time = 55883.51 ms / 9313.92 ms per layer
whisper_print_timings:   decode time =  3145.44 ms / 524.24 ms per layer
whisper_print_timings:    total time = 60706.79 ms

The tiny and base are the only two models that can be tested on my RPI4 with 2 GB of memory. The small model already runs out of memory.

?

Did you switch the branch to the stream branch as it seems to run much faster?
PS I couldn’t work out how to enable Blas but in the issues it seems like it makes little difference.

Also prob on a PI better to try the tiny.en model 1st

git fetch --all
git checkout stream
git reset --hard origin/stream

make clean
make stream

./stream -m ./models/ggml-tiny.en.bin -t 4 --step 4000 --length 8000

But also using that repo seems almost 3x faster with non stream
make tiny.en

I reserve 1/3 of memory for a zram compressed swap system within the OpenVoiceOS system. Probably the same approach as what you linked.

Anyhow, running it just quickly fill up the memory and swap and then kills itself.

Combined benchmarks table

mycroft@OpenVoiceOS-e3830c:~ $ ./bench-all.sh 4
Usage: ./bench.sh [n_threads]

Running benchmark for the tiny and base models
This can take a while!

| CPU | OS | Config | Model | Threads | Load [ms] | Encode [ms] |
| --- | -- | ------ | ----- | ------- | --------- | ----------- |
| Raspberry Pi 4 | OpenVoiceOS |  NEON | tiny | 4 | 835.68 | 21509.08 |
| Raspberry Pi 4 | OpenVoiceOS |  NEON | base | 4 | 1183.47 | 55256.20 |


mycroft@OpenVoiceOS-e3830c:~ $ ./bench-all.sh 1
Usage: ./bench.sh [n_threads]

Running benchmark for the tiny and base models
This can take a while!

| CPU | OS | Config | Model | Threads | Load [ms] | Encode [ms] |
| --- | -- | ------ | ----- | ------- | --------- | ----------- |
| Raspberry Pi 4 | OpenVoiceOS |  NEON | tiny | 1 | 861.34 | 29428.21 |
| Raspberry Pi 4 | OpenVoiceOS |  NEON | base | 1 | 1146.02 | 87615.00 |
CPU OS Config Model Threads Load [ms] Encode [ms]
RK3588 Ubuntu20.04 NEON tiny.en 4 243.54 ms 779.49 ms
RK3588 Ubuntu20.04 NEON base.en 4 316.52 ms 1821.06 ms
RK3588 Ubuntu20.04 NEON small.en 4 618.93 ms 7117.69 ms
RK3588 Ubuntu20.04 NEON medium.en 4 1514.88 ms 24139.92 ms
CPU OS Config Model Threads Load [ms] Encode [ms]
RK3588 Ubuntu20.04 NEON tiny 4 233.86 ms 791.01 ms
RK3588 Ubuntu20.04 NEON base 4 297.93 ms 1813.69 ms
RK3588 Ubuntu20.04 NEON small 4 592.18 ms 7102.28 ms
RK3588 Ubuntu20.04 NEON medium 4 1587.36 ms 24147.87 ms
CPU OS Config Model Threads Load [ms] Encode [ms]
RK3588 Ubuntu20.04 NEON tiny 8 226.48 ms 740.34 ms
RK3588 Ubuntu20.04 NEON base 8 300.48 ms 1723.42 ms
RK3588 Ubuntu20.04 NEON small 8 620.58 ms 6392.47 ms
RK3588 Ubuntu20.04 NEON medium 8 1533.75 ms 21899.08 ms

I still haven’t worked out the little(0-3).Big(4-7) on this thing as if I pin to big cores taskset -c 4-7

CPU OS Config Model Threads Load [ms] Encode [ms]
RK3588 Ubuntu20.04 NEON tiny.en 4 234.14 ms 681.53 ms
RK3588 Ubuntu20.04 NEON base.en 4 297.08 ms 1679.75 ms
RK3588 Ubuntu20.04 NEON small.en 4 599.98 ms 6867.66 ms
RK3588 Ubuntu20.04 NEON medium.en 4 1492.73 ms 23600.45 ms

I tried to compile with openBlas but seemed to kill the make


From the master repo as didn’t think about the repo after trying stream

CPU OS Config Model Threads Load [ms] Encode [ms]
RK3588 Ubuntu20.04 NEON tiny 8 226.48 ms 2681.05 ms
RK3588 Ubuntu20.04 NEON base 8 283.56 ms 6132.44 ms
RK3588 Ubuntu20.04 NEON small 8 583.39 ms 24397.78 ms
RK3588 Ubuntu20.04 NEON medium 8 1490.98 85099.45 ms

I think all the stream repo does is cut the ‘ctc’ width from 30 sec to 10 sec. So on a short command sentance its much faster whilst it also is slightly faster with longer transcription.
Also in the overall timing you have the load time but that will only happen on 1st load so if it stays resident in memory you can discount that.
I would give tiny.en a go with the stream mode.
Should of prob loaded from eMMC or NVME but SD is like for like.

As it is all run on CPU obviously having more cycles available influences the results. Below table represents the different information a bit better.

With the OVOS services running represents performance when run with the STT plugin (Although the python bindings and library way might still alter the data a bit. Not sure for the better or for the worse).

Without the OVOS services running is more for comparison with the other data out there.

CPU OS Config Model Threads Load [ms] Encode [ms] Remarks
Raspberry Pi 4 - 2GB OpenVoiceOS NEON tiny 1 861.34 29428.21 With OVOS services running
Raspberry Pi 4 - 2GB OpenVoiceOS NEON tiny 4 835.68 21509.08 With OVOS services running
Raspberry Pi 4 - 2GB OpenVoiceOS NEON base 1 1146.02 87615.00 With OVOS services running
Raspberry Pi 4 - 2GB OpenVoiceOS NEON base 4 1183.47 55256.20 With OVOS services running
Raspberry Pi 4 - 2GB OpenVoiceOS NEON tiny 1 752.64 24018.10 Without OVOS services running
Raspberry Pi 4 - 2GB OpenVoiceOS NEON tiny 4 743.37 10122.80 Without OVOS services running
Raspberry Pi 4 - 2GB OpenVoiceOS NEON base 1 974.46 71587.61 Without OVOS services running
Raspberry Pi 4 - 2GB OpenVoiceOS NEON base 4 982.24 24814.62 Without OVOS services running

As you can see, especially with using all 4 threads this makes a difference. No so much with 1 thread as the other 3 are still available for the whole OVOS software stack.

2 Likes

Its a shame really as Whisper steps up a notch on the small model that ‘turn on the light’ for me is perfect in the tiny model with the faster stream repo.
Others though for me
tiny.en → [00:00:00.000 → 00:00:05.760] setting a loan for Monday at 9 o’clock [832.27 ms]
base.en → [00:00:00.000 → 00:00:05.800] setting alarm for Monday at 9 o’clock. [ 2238.11 ms]
small.en → [00:00:00.000 → 00:00:05.800] Set an alarm for Monday at 9 o’clock [6852.66 ms]

small.en steps up a notch and gets it perfect, but is about x1.5 realtime of the same 4.4 sec sample.

The tiny and base do fit in my 2GB rpi4 model, the small just needs the 4GB as a minimum.

Going to test the tiny and base model with @JarbasAl his STT plugin in the upcoming days.

Try the stream repo as whatever it does even with normal file feed its x3 faster and works just the same on short command sentences.
I think the Pi4 is lacking in performance and falls a bit short and a bit short is actually a long way.
With Whisper that is.

Whisper is just a transformer by OPenAi and likely there will be more and maybe someone will split the model so SoC’s with AI accelerators or GPU’s can partition the model and share load with the CPU.
A lot of the NPU’s are int8 only and that only really supports fairly simple models.

Did you compare the streaming with the latest commit on the normal?

Isn’t the speed coming from the 2x speed commit the recently pushed?

Yeah I am thinking my Rk3588 is a tad short as the model you need to be running as minimum is the Small model really as that is when it gets near to some of what OPenAI claim. In fact prob quite a bit off and 3x+ what you will get on a Pi4.
The base & tiny model start to fall off a cliff with WER and yeah it will run them just about, but the PI4 is not particually good at running transformer models even very heavily optimised code and quantised models, its more of a celebration that it runs than a working ASR.
The Whisper model is a translation model based on a 30 sec window and sort of far from optimised for short sentence smartAI commands even though Georgi Gerganov has made some god level coding optimisation and model hack to get where we are.
It has some english only models also hcked in and as examples they produced some vasty smaller models but the published WER and acclaim is for the large model that scales quite well to the small and then goes on a bit of a nose dive.
If the training methods of Whisper was also open source then likely it would be a different story and Linux has its 1st commericial comparitive ASR.
Part of it is the way Whisper works which is far less on phonetic correctness and what has the highest matrix score over a 30 second window and it gets longer sentances far more correct than it does short ones as they contain so much more sentence logic.
Get a mic and give it a try with typical command sentences than maybe reading book like narrative.
The shorter the sentence and the smaller the model the worse it gets.

Did some more testing with OpenBLAS. For Aarch64 this brings again some performance gains. For completeness-sake the full table;

CPU OS Config Model Threads Load [ms] Encode [ms] Remarks
Raspberry Pi 4 - 2GB OpenVoiceOS NEON tiny 1 861.34 29428.21 With OVOS services running
Raspberry Pi 4 - 2GB OpenVoiceOS NEON BLAS tiny 1 843.80 16145.62 With OVOS services running
Raspberry Pi 4 - 2GB OpenVoiceOS NEON tiny 4 835.68 21509.08 With OVOS services running
Raspberry Pi 4 - 2GB OpenVoiceOS NEON BLAS tiny 4 824.24 13187.96 With OVOS services running
Raspberry Pi 4 - 2GB OpenVoiceOS NEON base 1 1146.02 87615.00 With OVOS services running
Raspberry Pi 4 - 2GB OpenVoiceOS NEON BLAS base 1 1103.39 52228.30 With OVOS services running
Raspberry Pi 4 - 2GB OpenVoiceOS NEON base 4 1183.47 55256.20 With OVOS services running
Raspberry Pi 4 - 2GB OpenVoiceOS NEON BLAS base 4 1161.32 29851.40 With OVOS services running
Raspberry Pi 4 - 2GB OpenVoiceOS NEON tiny 1 752.64 24018.10 Without OVOS services running
Raspberry Pi 4 - 2GB OpenVoiceOS NEON BLAS tiny 1 751.96 13082.95 Without OVOS services running
Raspberry Pi 4 - 2GB OpenVoiceOS NEON tiny 4 743.37 10122.80 Without OVOS services running
Raspberry Pi 4 - 2GB OpenVoiceOS NEON BLAS tiny 4 742.90 9564.89 Without OVOS services running
Raspberry Pi 4 - 2GB OpenVoiceOS NEON base 1 974.46 71587.61 Without OVOS services running
Raspberry Pi 4 - 2GB OpenVoiceOS NEON BLAS base 1 979.65 43852.07 Without OVOS services running
Raspberry Pi 4 - 2GB OpenVoiceOS NEON base 4 982.24 24814.62 Without OVOS services running
Raspberry Pi 4 - 2GB OpenVoiceOS NEON BLAS base 4 982.80 19910.19 Without OVOS services running

With the tiny model, close to 1x real time;

mycroft@OpenVoiceOS-e3830c:~/whispercpp $ ./main -m models/ggml-tiny.bin -f samples/jfk.wav
whisper_model_load: loading model from 'models/ggml-tiny.bin'
whisper_model_load: n_vocab       = 51865
whisper_model_load: n_audio_ctx   = 1500
whisper_model_load: n_audio_state = 384
whisper_model_load: n_audio_head  = 6
whisper_model_load: n_audio_layer = 4
whisper_model_load: n_text_ctx    = 448
whisper_model_load: n_text_state  = 384
whisper_model_load: n_text_head   = 6
whisper_model_load: n_text_layer  = 4
whisper_model_load: n_mels        = 80
whisper_model_load: f16           = 1
whisper_model_load: type          = 1
whisper_model_load: mem_required  = 390.00 MB
whisper_model_load: adding 1608 extra tokens
whisper_model_load: ggml ctx size =  73.58 MB
whisper_model_load: memory size =    11.41 MB
whisper_model_load: model size  =    73.54 MB

system_info: n_threads = 4 / 4 | AVX2 = 0 | AVX512 = 0 | NEON = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 |

main: processing 'samples/jfk.wav' (176000 samples, 11.0 sec), 4 threads, 1 processors, lang = en, task = transcribe, timestamps = 1 ...


[00:00:00.000 --> 00:00:11.000]   And so my fellow Americans ask not what your country can do for you ask what you can do for your country.


whisper_print_timings:     load time =   741.98 ms
whisper_print_timings:      mel time =   200.68 ms
whisper_print_timings:   sample time =    22.64 ms
whisper_print_timings:   encode time =  9588.57 ms / 2397.14 ms per layer
whisper_print_timings:   decode time =   716.64 ms / 179.16 ms per layer
whisper_print_timings:    total time = 11276.80 ms

With the base mode, twice as slow however a little bit better transcribe as it picked up a comma :smiley: (Could be by accident)

mycroft@OpenVoiceOS-e3830c:~/whispercpp $ ./main -m models/ggml-base.bin -f samples/jfk.wav
whisper_model_load: loading model from 'models/ggml-base.bin'
whisper_model_load: n_vocab       = 51865
whisper_model_load: n_audio_ctx   = 1500
whisper_model_load: n_audio_state = 512
whisper_model_load: n_audio_head  = 8
whisper_model_load: n_audio_layer = 6
whisper_model_load: n_text_ctx    = 448
whisper_model_load: n_text_state  = 512
whisper_model_load: n_text_head   = 8
whisper_model_load: n_text_layer  = 6
whisper_model_load: n_mels        = 80
whisper_model_load: f16           = 1
whisper_model_load: type          = 2
whisper_model_load: mem_required  = 506.00 MB
whisper_model_load: adding 1608 extra tokens
whisper_model_load: ggml ctx size = 140.60 MB
whisper_model_load: memory size =    22.83 MB
whisper_model_load: model size  =   140.54 MB

system_info: n_threads = 4 / 4 | AVX2 = 0 | AVX512 = 0 | NEON = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 |

main: processing 'samples/jfk.wav' (176000 samples, 11.0 sec), 4 threads, 1 processors, lang = en, task = transcribe, timestamps = 1 ...


[00:00:00.000 --> 00:00:07.600]   And so my fellow Americans ask not what your country can do for you,
[00:00:07.600 --> 00:00:10.600]   ask what you can do for your country.


whisper_print_timings:     load time =   978.30 ms
whisper_print_timings:      mel time =   198.97 ms
whisper_print_timings:   sample time =    28.10 ms
whisper_print_timings:   encode time = 20292.19 ms / 3382.03 ms per layer
whisper_print_timings:   decode time =  1504.07 ms / 250.68 ms per layer
whisper_print_timings:    total time = 23004.49 ms

Just about managed to get the small model into memory, so here it goes;

mycroft@OpenVoiceOS-e3830c:~/whispercpp $ ./main -m models/ggml-small.bin -f samples/jfk.wav
whisper_model_load: loading model from 'models/ggml-small.bin'
whisper_model_load: n_vocab       = 51865
whisper_model_load: n_audio_ctx   = 1500
whisper_model_load: n_audio_state = 768
whisper_model_load: n_audio_head  = 12
whisper_model_load: n_audio_layer = 12
whisper_model_load: n_text_ctx    = 448
whisper_model_load: n_text_state  = 768
whisper_model_load: n_text_head   = 12
whisper_model_load: n_text_layer  = 12
whisper_model_load: n_mels        = 80
whisper_model_load: f16           = 1
whisper_model_load: type          = 3
whisper_model_load: mem_required  = 1044.00 MB
whisper_model_load: adding 1608 extra tokens
whisper_model_load: ggml ctx size = 464.56 MB
whisper_model_load: memory size =    68.48 MB
whisper_model_load: model size  =   464.44 MB

system_info: n_threads = 4 / 4 | AVX2 = 0 | AVX512 = 0 | NEON = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 |

main: processing 'samples/jfk.wav' (176000 samples, 11.0 sec), 4 threads, 1 processors, lang = en, task = transcribe, timestamps = 1 ...


[00:00:00.000 --> 00:00:11.000]   And so my fellow Americans, ask not what your country can do for you, ask what you can do for your country.


whisper_print_timings:     load time =  4153.16 ms
whisper_print_timings:      mel time =   207.59 ms
whisper_print_timings:   sample time =    32.41 ms
whisper_print_timings:   encode time = 80229.52 ms / 6685.79 ms per layer
whisper_print_timings:   decode time =  3845.80 ms / 320.48 ms per layer
whisper_print_timings:    total time = 88472.29 ms
1 Like

Yeah I just don’t think the Pi4 has the Ooomf to run transformer models.

Intel have been doing optimisations Fast DistilBERT on CPUs if you google, but currently there are a lot of models out of range of a Pi4 or at least like Whisper the very small heavilly quantised models bare little accuracy to which the bigger models are famed for.
I have got a 6Tops NPU embedded into the Rk3588 that I think will run x3 faster at least than CPU if I can get a int8 quantised version of whisper.
When you get an embedded NPU sharing address and memory with the Soc it gets much faster than the Coral USB addons or at least much faster than many of the coral benchmarks I have seen.
For £70 the x4 that the Coral USB seems to give makes me wonder if its worth it. It is really hard to compare as how TOPs are accounted, quantisation and models differ but probably the small.en whisper model would run will on a int8 converted for a NPU of approx 4-6 tops.
On the 6tops npu of the rk3588 it produces approx 190 fps with resnet-18 with 224x224 images but what does that mean?
The power draw on the npu is much less though as 1.5watt peak so as well as being more performant its also cooler and more efficient than a cpu which running the same model is about 5 watts and the only real comparison I can make.

I am thinking Georgi Gerganov has already taken optimisation to the max and currently, it is use a GPU or keep your fingers crossed for a int8 npu version.

This looks interesting as its been converted to tflite but also looks like its been seperated into distinct functions.

Its tflite and by the looks of it much already converted to int8 so likely would run on a Coral accelerator or even partition a model between cpu & npu.

INFO: Created TensorFlow Lite XNNPACK delegate for CPU.
Inference time 2 seconds

[_SOT_][_NOT_] And so my fellow Americans ask not what your country can do for you, ask what you can do for your country.

PS has anyone tried the bigger models of Whisper as for some reason it not mentioned but you can have crazy SNR levels and the medium or large model compared to others by magic get transcription correct.

I have been playing with DeepFilterNet which is great but not much good for Whisper as it actually makes results worse.
If you have a model that you can train then likely you would have to preprocess the dataset with DeepFilterNet but quite likely in terms of noise do as well. OpenAi have kept the training of Whisper to themselves but you should try the larger models with noise as the results are outstanding.

Hi there,
I’m quite new to mycroft.
Can I hope that we see Introducing Whisper once on the Mark II - Mycroft
I’m thinking to buy one but still unsure as I have string NAS as well what may could handle “Whisper”.
thank you

@StuartIanNaylor Reaching out to you here as the question is very much related to all of the above.

Also looking into the INT8 conversed whisper model that runs on TFLite as I expect that is more of an option for low powered devices such as the RPI family. Thank you very much for the heads up about it.

I see you investigated the noise removal option DeepFilterNet and just like me a long time ago when I looked into RNNoise removal it was not really helping.

Anyhow the question: How/What did you do with the audio bitrate? Tensorflow Lite, Whisper, etc are all trained on and only support 16 kHz audio while all the noise reduction programs and models are trained and only supporting 48 kHz audio. I guess resampling will not do the whole process any good. Especially resampling from 48 kHz with artifacts after the noise removal into 16 kHz.

Resampling down is lossless its just upsampling where there are problems.
DeepFilterNet he only started developing a year ago with the 1st release Nov 5, 2021 and artefacts don’t matter but if using such a filter (stay away from RNNoise as its bad and that is very old, DTLN is OK as well) you need to create a dataset that have been cleaned by the filter not a clean dataset that most ASR have used or KWS then when you train the training learns the filter fingerprint (artefacts) as part of the training.

The MS DNS-Challenge has a quite good tool to add noise to a dataset

You would have to do all of Librispeech or something then run your filter on it then do your training and this is true with even hardware if you truly wanted it optimal.

So yeah downsampling is no problem and really neither are the artefacts its the models that are trained on clean voice in this scenario that are the problem.

Whisper on the bigger models has jaw dropping ability to filter out noise itself and if we could train it with Deepfilter it would likely be even more awesome or maybe something is essentially lost, but do not think so its just a cutting edge transformer model that would equally learn filtered voice but that is absent in its dataset.

People are trying to hack the whisper model as it was released as a binary and dunno id int8 will happen or not but NPU/GPU/CPU all become part of a ML processing unit on fairly low cost devices.
I have been wondering if Google Coral are going to update there devices but many good solutions are cheaper and faster being onboard and sharing memory.

ps @j1nx I have started again playing with KWS as my Rtx3050 aint really up for ASR training and its been a while and my MS plays havoc with my memory.
If you ever get the chance then check this out and tell me if Sounddevice is doing the same for you as sure it wasn’t like this but having to set a software gain after much headscratching why I wasn’t getting recognition.

https://drive.google.com/file/d/1m8-LvW9vpOG4iJVYUaOWRr-QKGuA1cl7/view?usp=share_link

That is just a tflite and just the simple code is here, but am I doing something stupid as once more after a break its like starting with a blank slate.

import tensorflow as tf
import sounddevice as sd
import numpy as np
import threading
  
def sd_callback(rec, frames, time, status):
    global gain, max_rec, kw_hit, kw_count, sample_rate, rec_duration
    # Notify if errors
    if status:
        print('Error:', status)
    
    rec = np.reshape(rec, (1, int(sample_rate * rec_duration)))
    rec = np.multiply(rec, gain)
    
    # Make prediction from model
    interpreter1.set_tensor(input_details1[0]['index'], rec)
    # set input states (index 1...)
    for s in range(1, len(input_details1)):
      interpreter1.set_tensor(input_details1[s]['index'], inputs1[s])
  
    interpreter1.invoke()
    output_data = interpreter1.get_tensor(output_details1[0]['index'])
    # get output states and set it back to input states
    # which will be fed in the next inference cycle
    for s in range(1, len(input_details1)):
      # The function `get_tensor()` returns a copy of the tensor data.
      # Use `tensor()` in order to get a pointer to the tensor.
      inputs1[s] = interpreter1.get_tensor(output_details1[s]['index'])
       
    lvl = np.max(np.abs(rec))
    if lvl > max_rec:
      max_rec = lvl
          
    if output_data[0][0] > 0.95:
      kw_hit = True
      kw_count += 1
      print("Marvin:", output_data[0][0], lvl)

    elif output_data[0][1] > 0.90:
        if kw_hit == True:
          print('Max lvl:', max_rec)
          kw_hit = False
          max_rec = 0.0
          if kw_count > 60:
            print('Hello Marvin', kw_count)
        kw_count = 0

        
# Parameters
rec_duration = 0.020
sample_rate = 16000
num_channels = 1

gain = 10.0
max_rec = 0.0
kw_hit = False
kw_count = 0
sd.default.latency= ('high', 'high')
sd.default.dtype= ('float32', 'float32')

# Load the TFLite model and allocate tensors.
interpreter1 = tf.lite.Interpreter(model_path="../GoogleKWS/models2/crnn/tflite_stream_state_external/stream_state_external.tflite")
#interpreter1 = tf.lite.Interpreter(model_path="../GoogleKWS/models2/bc_resnet_2/tflite_stream_state_external/stream_state_external.tflite")

interpreter1.allocate_tensors()

# Get input and output tensors.
input_details1 = interpreter1.get_input_details()
output_details1 = interpreter1.get_output_details()

inputs1 = []

for s in range(len(input_details1)):
  inputs1.append(np.zeros(input_details1[s]['shape'], dtype=np.float32))
    

# Start streaming from microphone
with sd.InputStream(channels=num_channels,
                    samplerate=sample_rate,
                    blocksize=int(sample_rate * rec_duration),
                    callback=sd_callback):
    threading.Event().wait()

Can you see how I have hacked it with gain = 10.0 is it same on your system or is it just me?

Going through your post a bit later, just a quicky;

Exactly. Most KWS systems are trained on 16 kHz and is is most likely useless to de-noise the continuous listening mode for KWS.

So run a continuous listening thread for KWS. Take precise-lite as example as we are on the Mycroft forums. For that we have to run our hardware in 16 kHz. As soon as we have KWS confirmation we will use VAD detection to know when we stopped speaking. For that we can use webrtc or Silero which both do a proper job. The latter performs best however needs onnx-runtime which I have not yet implemented.

Anyhow for all VAD detected audio we need to resample our mic up to 48 kHz to feed it into the denoise system and then straight back into 16 kHz for the STT system. I have not yet run some tests as I am still creating the whole image to start playing with it, but my gut feeling just says it is just way to many audio changes.

Just trying to find different software components to play with that are a bit more aligned to one hardware setting for the audio.

…getting back to you about the rest…