The latest OpenVoiceOS images do by default ship with libwhispercpp installed together with the STT @JarbasAl linked above.
Interesting to see how the developemnt will go as it looks like it goes in a rapid pace.
Here are some benchmarks and test similar as what @StuartIanNaylor posted with WhisperCPP cross compiled within the whole buildroot system. I might redo them later with libwhispercpp compiled with the OpenBLAS option.
Benchmark - tiny model (as that is the default of the STT plugin)
mycroft@OpenVoiceOS-e3830c:~ $ ./bench -m ggml-tiny.bin -t 4
whisper_model_load: loading model from 'ggml-tiny.bin'
whisper_model_load: n_vocab = 51865
whisper_model_load: n_audio_ctx = 1500
whisper_model_load: n_audio_state = 384
whisper_model_load: n_audio_head = 6
whisper_model_load: n_audio_layer = 4
whisper_model_load: n_text_ctx = 448
whisper_model_load: n_text_state = 384
whisper_model_load: n_text_head = 6
whisper_model_load: n_text_layer = 4
whisper_model_load: n_mels = 80
whisper_model_load: f16 = 1
whisper_model_load: type = 1
whisper_model_load: mem_required = 390.00 MB
whisper_model_load: adding 1608 extra tokens
whisper_model_load: ggml ctx size = 73.58 MB
whisper_model_load: memory size = 11.41 MB
whisper_model_load: model size = 73.54 MB
system_info: n_threads = 4 / 4 | AVX2 = 0 | AVX512 = 0 | NEON = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 |
whisper_print_timings: load time = 1693.92 ms
whisper_print_timings: mel time = 0.00 ms
whisper_print_timings: sample time = 0.00 ms
whisper_print_timings: encode time = 20830.68 ms / 5207.67 ms per layer
whisper_print_timings: decode time = 0.00 ms / 0.00 ms per layer
whisper_print_timings: total time = 22524.77 ms
If you wish, you can submit these results here:
https://github.com/ggerganov/whisper.cpp/issues/89
Please include the following information:
- CPU model
- Operating system
- Compiler
mycroft@OpenVoiceOS-e3830c:~ $ ./main -m ggml-tiny.bin -f jfk.wav -t 4
whisper_model_load: loading model from 'ggml-tiny.bin'
whisper_model_load: n_vocab = 51865
whisper_model_load: n_audio_ctx = 1500
whisper_model_load: n_audio_state = 384
whisper_model_load: n_audio_head = 6
whisper_model_load: n_audio_layer = 4
whisper_model_load: n_text_ctx = 448
whisper_model_load: n_text_state = 384
whisper_model_load: n_text_head = 6
whisper_model_load: n_text_layer = 4
whisper_model_load: n_mels = 80
whisper_model_load: f16 = 1
whisper_model_load: type = 1
whisper_model_load: mem_required = 390.00 MB
whisper_model_load: adding 1608 extra tokens
whisper_model_load: ggml ctx size = 73.58 MB
whisper_model_load: memory size = 11.41 MB
whisper_model_load: model size = 73.54 MB
system_info: n_threads = 4 / 4 | AVX2 = 0 | AVX512 = 0 | NEON = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 |
main: processing 'jfk.wav' (176000 samples, 11.0 sec), 4 threads, 1 processors, lang = en, task = transcribe, timestamps = 1 ...
[00:00:00.000 --> 00:00:11.000] And so my fellow Americans ask not what your country can do for you ask what you can do for your country.
whisper_print_timings: load time = 930.82 ms
whisper_print_timings: mel time = 325.40 ms
whisper_print_timings: sample time = 34.11 ms
whisper_print_timings: encode time = 21394.58 ms / 5348.65 ms per layer
whisper_print_timings: decode time = 1241.69 ms / 310.42 ms per layer
whisper_print_timings: total time = 23929.93 ms
