Adding "Whisper" as local STT option?

I never noticed Jarbas I presumed you where using the original openAi with GPU support as never read the thread, as even though what Georgi has done is amazing you still need a monster CPU to run it.

Whisper is absolutely amazing but the WER accuracy drops off substantially on the Tiny & Base (Multi) models and it doesn’t make a lot of sense to use those as for Load/Wer there are probably older and better solutions. The Small model (multi) starts to get near those great WER % Whisper produces but the size and load is sort of exponential

Size Parameters English-only model Multilingual model Required VRAM Relative speed
tiny 39 M tiny.en tiny ~1 GB ~32x
base 74 M base.en base ~1 GB ~16x
small 244 M small.en small ~2 GB ~6x
medium 769 M medium.en medium ~5 GB ~2x
large 1550 M N/A large ~10 GB 1x

I haven’t tried the original on GPU and also the streaming CPU mode seems to send things crazy again where the load balloons even more.
Loading a full wav on a MacBook M1 Pro returned almost x6 realtime medium.en which I presume is as accurate as the Large Multilanguage model, but presuming its a similar story with the streaming input where dunno why it seems so much more load, but hazard a guess streaming on a MacBook M1 Pro might only be able to do the base.en model.

It really is amazing but wow its a total monster of a model I am presuming the Multilanguage model is similar in accuracy to the previous smaller specific language model but never tried Multilanguage.

yeah, this will really need a beefy machine for multilang, but im hoping the tiny.en model will be usable in a pi4…

my intended usage it to use this plugin with ovos TTS Server - Documentation so when i am done with the docker file i can have a large model running in a beefy setup and the pis just sending http requests

Not sure if it makes sense as the WER % drops off a cliff for the tiny & base models (suposedly from another reviewer) but yeah for a larger but dunno about running those on CPU as say running on GPU after some time screaming at my computer and trying to install cuda11.6 on ubuntu 22.04, use one of the Nvidia docker containers instead as I give up!
But install the right torch 1st
pip3 install torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/cu116
Then install Whisper
pip install git+https://github.com/openai/whisper.git
Using https://commons.wikimedia.org/wiki/File:Reagan_Space_Shuttle_Challenger_Speech.ogv 4m:48s
time whisper Reagan_Space_Shuttle_Challenger_Speech.ogv --best_of None --beam_size None --model medium.en --threads=8

real 0m42.072s
user 0m46.303s
sys 0m3.591s

time whisper Reagan_Space_Shuttle_Challenger_Speech.ogv --best_of None --beam_size None --model small.en --threads=8

real 0m22.323s
user 0m24.127s
sys 0m2.545s

time whisper Reagan_Space_Shuttle_Challenger_Speech.ogv --best_of None --beam_size None --model base.en --threads=8

real 0m13.119s
user 0m14.324s
sys 0m2.137s

time whisper Reagan_Space_Shuttle_Challenger_Speech.ogv --best_of None --beam_size None --model tiny.en --threads=8

real 0m10.855s
user 0m11.907s
sys 0m2.106s

So thinking even though that is only a RTX3050 desktop GPU tad slower than a GTX1070 from memory it still beats the pants out of running on CPU unless something pretty awesome.
With GPU when testing run twice as the model load into vram accounts for much but 2nd run is far faster I think it is purely the model load.
For Mac users on Arm GitHub - ggerganov/whisper.cpp: Port of OpenAI's Whisper model in C/C++ is amazing but maybe not so if you have a GPU… Just been reading the Metal framework has been avail on pytorch since June.

PS the ‘review’ was when I was just googling and knew nothing about Whisper and just coincidence we have all seemed found it, thinking about it I have no idea if the review was correct as thinking about it it did seem a bit critical. Said Whisper is very good but occasionally it gets things totally wrong and that it could of just been the multilingual tiny & base models it was critical of and that published WER was optimistic and it just stuck in my head and could be overly critical?

Just to go on about about WER/Load https://arxiv.org/pdf/2005.03191.pdf

ContextNet achieves a word error rate (WER) of 2.1%/4.6% without external language model (LM), 1.9%/4.1% with LM and 2.9%/7.0% with only 10M parameters on the clean/noisy LibriSpeech test sets

Which is likely a much better fit for a Pi4 with 10M parameters being a quarter of the whisper Tiny model and very likely directly converts to inference speed.
I have always liked GitHub - TensorSpeech/TensorFlowASR: TensorFlowASR: Almost State-of-the-art Automatic Speech Recognition in Tensorflow 2. Supported languages that can use characters or subwords and maybe because of my preference over pytorch that you can do the same things with PyTorch but with TFLite I have a reasonable knowledge how easy it is to use a TFlite Coral Delegate, or Mali or whatever or partition a model so it runs across several simultaneously of cpu/gpu/npu which is why I have the RK3588.
Same with TTS with GitHub - TensorSpeech/TensorFlowTTS: TensorFlowTTS: Real-Time State-of-the-art Speech Synthesis for Tensorflow 2 (supported including English, French, Korean, Chinese, German and Easy to adapt for other languages) as conversion to Tflite and support for embedded and accelerators seem much better, or at least was and now its because I am dodging Pytorch I am lacking knowledge.

Arm do an interesting Tutorial with ArmNN that is pretty awful in results but prob going to play with it to see if I can get the Mali delegate working.

https://developer.arm.com/documentation/102603/2108/Device-specific-installation/Install-on-Raspberry-Pi

 rock@rock-5b:~/workspace/armnn/python/pyarmnn/examples/speech_recognition$ time python3 run_audio_file.py --audio_file_path tests/testdata/quick_brown_fox_16000khz.wav --model_file_path tflite_int8/wav2letter_int8.tflite  --preferred_backends CpuAcc CpuRef
Your ArmNN library instance does not support Onnx models parser functionality.  Skipped IOnnxParser import.
Preferred backends: ['CpuAcc', 'CpuRef']
IDeviceSpec { supportedBackends: [CpuAcc, CpuRef]}
Optimization warnings: ()
Processing Audio Frames...
the quick brown fox juhmpe over the llazy dag

real    0m2.693s
user    0m8.031s
sys     0m0.282s

Just to see if I can switch to GPU with the Mali G610MP4 which is supposed to have pretty good ML perf

# Copyright © 2021 Arm Ltd and Contributors. All rights reserved.
# SPDX-License-Identifier: MIT

"""Automatic speech recognition with PyArmNN demo for processing audio clips to text."""

import sys
import os
import numpy as np

script_dir = os.path.dirname(__file__)
sys.path.insert(1, os.path.join(script_dir, '..', 'common'))

from argparse import ArgumentParser
from network_executor import ArmnnNetworkExecutor
from utils import prepare_input_data
from audio_capture import AudioCaptureParams, capture_audio
from audio_utils import decode_text, display_text
from wav2letter_mfcc import Wav2LetterMFCC, W2LAudioPreprocessor
from mfcc import MFCCParams

# Model Specific Labels
labels = {0: 'a', 1: 'b', 2: 'c', 3: 'd', 4: 'e', 5: 'f', 6: 'g', 7: 'h', 8: 'i', 9: 'j', 10: 'k', 11: 'l', 12: 'm',
          13: 'n',
          14: 'o', 15: 'p', 16: 'q', 17: 'r', 18: 's', 19: 't', 20: 'u', 21: 'v', 22: 'w', 23: 'x', 24: 'y',
          25: 'z',
          26: "'", 27: ' ', 28: '$'}


def parse_args():
    parser = ArgumentParser(description="ASR with PyArmNN")
    parser.add_argument(
        "--audio_file_path",
        required=True,
        type=str,
        help="Path to the audio file to perform ASR",
    )
    parser.add_argument(
        "--model_file_path",
        required=True,
        type=str,
        help="Path to ASR model to use",
    )
    parser.add_argument(
        "--preferred_backends",
        type=str,
        nargs="+",
        default=["CpuAcc", "CpuRef"],
        help="""List of backends in order of preference for optimizing
        subgraphs, falling back to the next backend in the list on unsupported
        layers. Defaults to [CpuAcc, CpuRef]""",
    )
    return parser.parse_args()


def main(args):
    # Read command line args
    audio_file = args.audio_file_path

    # Create the ArmNN inference runner
    network = ArmnnNetworkExecutor(args.model_file_path, args.preferred_backends)

    # Specify model specific audio data requirements
    audio_capture_params = AudioCaptureParams(dtype=np.float32, overlap=31712, min_samples=47712, sampling_freq=16000,
                                              mono=True)

    buffer = capture_audio(audio_file, audio_capture_params)

    # Extract features and create the preprocessor

    mfcc_params = MFCCParams(sampling_freq=16000, num_fbank_bins=128, mel_lo_freq=0, mel_hi_freq=8000,
                             num_mfcc_feats=13, frame_len=512, use_htk_method=False, n_fft=512)

    wmfcc = Wav2LetterMFCC(mfcc_params)
    preprocessor = W2LAudioPreprocessor(wmfcc, model_input_size=296, stride=160)
    current_r_context = ""
    is_first_window = True

    print("Processing Audio Frames...")
    for audio_data in buffer:
        # Prepare the input Tensors
        input_data = prepare_input_data(audio_data, network.get_data_type(), network.get_input_quantization_scale(0),
                                        network.get_input_quantization_offset(0), preprocessor)

        # Run inference
        output_result = network.run([input_data])

        # Slice and Decode the text, and store the right context
        current_r_context, text = decode_text(is_first_window, labels, output_result)

        is_first_window = False

        display_text(text)

    print(current_r_context, flush=True)


if __name__ == "__main__":
    args = parse_args()
    main(args)

Wav2Letter seems to be exactly what the name is and lacks a context dictionary, but it was ArmNN that was of interest as unlike a Pi many Arm boards are now have quite capable GPU’s and NPU’s.
The 8gb $149.00 Rock5b might sound expensive when compared to a Pi4 but on the CPU it ran Whisper x5 faster and also has the most powerful Mali based GPU I have seen and also a 6 Tops NPU (Supposedly) as Tops is not a good metric.

These boards are never going to compete with the latest and greatest GPU’s but the can partition models and uses system ram where with a dedicated GPU you may want allocate to a single model already loaded in vram.
But either way server based systems shared across clients (satelites) is a far superior infrastructure as the diversification of commands is inherently client-server with the big load of ASR & TTS being idle the majority of time and a time frame of use where queued clashes are low.

Rock5b is still only shipping to the early adopters but apparently OKDO will stocking them, the carrot was $50 off as the distro images are still extremely raw.

The latest OpenVoiceOS images do by default ship with libwhispercpp installed together with the STT @JarbasAl linked above.

Interesting to see how the developemnt will go as it looks like it goes in a rapid pace.

Here are some benchmarks and test similar as what @StuartIanNaylor posted with WhisperCPP cross compiled within the whole buildroot system. I might redo them later with libwhispercpp compiled with the OpenBLAS option.

Benchmark - tiny model (as that is the default of the STT plugin)

mycroft@OpenVoiceOS-e3830c:~ $ ./bench -m ggml-tiny.bin -t 4
whisper_model_load: loading model from 'ggml-tiny.bin'
whisper_model_load: n_vocab       = 51865
whisper_model_load: n_audio_ctx   = 1500
whisper_model_load: n_audio_state = 384
whisper_model_load: n_audio_head  = 6
whisper_model_load: n_audio_layer = 4
whisper_model_load: n_text_ctx    = 448
whisper_model_load: n_text_state  = 384
whisper_model_load: n_text_head   = 6
whisper_model_load: n_text_layer  = 4
whisper_model_load: n_mels        = 80
whisper_model_load: f16           = 1
whisper_model_load: type          = 1
whisper_model_load: mem_required  = 390.00 MB
whisper_model_load: adding 1608 extra tokens
whisper_model_load: ggml ctx size =  73.58 MB
whisper_model_load: memory size =    11.41 MB
whisper_model_load: model size  =    73.54 MB

system_info: n_threads = 4 / 4 | AVX2 = 0 | AVX512 = 0 | NEON = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 |

whisper_print_timings:     load time =  1693.92 ms
whisper_print_timings:      mel time =     0.00 ms
whisper_print_timings:   sample time =     0.00 ms
whisper_print_timings:   encode time = 20830.68 ms / 5207.67 ms per layer
whisper_print_timings:   decode time =     0.00 ms / 0.00 ms per layer
whisper_print_timings:    total time = 22524.77 ms

If you wish, you can submit these results here:

  https://github.com/ggerganov/whisper.cpp/issues/89

Please include the following information:

  - CPU model
  - Operating system
  - Compiler
mycroft@OpenVoiceOS-e3830c:~ $ ./main -m ggml-tiny.bin -f jfk.wav -t 4
whisper_model_load: loading model from 'ggml-tiny.bin'
whisper_model_load: n_vocab       = 51865
whisper_model_load: n_audio_ctx   = 1500
whisper_model_load: n_audio_state = 384
whisper_model_load: n_audio_head  = 6
whisper_model_load: n_audio_layer = 4
whisper_model_load: n_text_ctx    = 448
whisper_model_load: n_text_state  = 384
whisper_model_load: n_text_head   = 6
whisper_model_load: n_text_layer  = 4
whisper_model_load: n_mels        = 80
whisper_model_load: f16           = 1
whisper_model_load: type          = 1
whisper_model_load: mem_required  = 390.00 MB
whisper_model_load: adding 1608 extra tokens
whisper_model_load: ggml ctx size =  73.58 MB
whisper_model_load: memory size =    11.41 MB
whisper_model_load: model size  =    73.54 MB

system_info: n_threads = 4 / 4 | AVX2 = 0 | AVX512 = 0 | NEON = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 |

main: processing 'jfk.wav' (176000 samples, 11.0 sec), 4 threads, 1 processors, lang = en, task = transcribe, timestamps = 1 ...


[00:00:00.000 --> 00:00:11.000]   And so my fellow Americans ask not what your country can do for you ask what you can do for your country.


whisper_print_timings:     load time =   930.82 ms
whisper_print_timings:      mel time =   325.40 ms
whisper_print_timings:   sample time =    34.11 ms
whisper_print_timings:   encode time = 21394.58 ms / 5348.65 ms per layer
whisper_print_timings:   decode time =  1241.69 ms / 310.42 ms per layer
whisper_print_timings:    total time = 23929.93 ms

Strangely I am sat infront of a mic retrying the streaming mode as I never checked out the streaming branch before so prob why results where bad.
Haven’t got a Pi4 anymore so running on a Rock5b RK3588.
Streaming has had an update and seems really good but not sure on short command sentences if streaming mode is worthwhile when latency is so short and non streaming accuracy is really really good.

I will post a bench here again just as check to see the non stream perf as with streaming mode you can only really check load as you are feeding a stream that is a bit hard to measure.
Still the old rules apply you need to get a good signal from your Mic in terms of volume which often is pretty poor without putting on a AGC and volume to max.

I can show you load no Pi4 to try on though.

No BLAS

./main -m models/ggml-tiny.en.bin -f samples/jfk.wav
whisper_model_load: loading model from 'models/ggml-tiny.en.bin'
whisper_model_load: n_vocab       = 51864
whisper_model_load: n_audio_ctx   = 1500
whisper_model_load: n_audio_state = 384
whisper_model_load: n_audio_head  = 6
whisper_model_load: n_audio_layer = 4
whisper_model_load: n_text_ctx    = 448
whisper_model_load: n_text_state  = 384
whisper_model_load: n_text_head   = 6
whisper_model_load: n_text_layer  = 4
whisper_model_load: n_mels        = 80
whisper_model_load: f16           = 1
whisper_model_load: type          = 1
whisper_model_load: mem_required  = 390.00 MB
whisper_model_load: adding 1607 extra tokens
whisper_model_load: ggml ctx size =  73.58 MB
whisper_model_load: memory size =    11.41 MB
whisper_model_load: model size  =    73.54 MB

system_info: n_threads = 4 / 8 | AVX2 = 0 | AVX512 = 0 | NEON = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 |

main: processing 'samples/jfk.wav' (176000 samples, 11.0 sec), 4 threads, 1 processors, lang = en, task = transcribe, timestamps = 1 ...


[00:00:00.000 --> 00:00:07.540]   And so my fellow Americans ask not what your country can do for you
[00:00:07.540 --> 00:00:10.160]   ask what you can do for your country.
[00:00:10.160 --> 00:00:30.000]   You can do for your country


whisper_print_timings:     load time =   305.88 ms
whisper_print_timings:      mel time =   134.55 ms
whisper_print_timings:   sample time =    11.85 ms
whisper_print_timings:   encode time =   802.06 ms / 200.51 ms per layer
whisper_print_timings:   decode time =   321.21 ms / 80.30 ms per layer
whisper_print_timings:    total time =  1576.53 ms

Benchmark - base model

mycroft@OpenVoiceOS-e3830c:~ $ ./bench -m models/ggml-base.bin -t 4
whisper_model_load: loading model from 'models/ggml-base.bin'
whisper_model_load: n_vocab       = 51865
whisper_model_load: n_audio_ctx   = 1500
whisper_model_load: n_audio_state = 512
whisper_model_load: n_audio_head  = 8
whisper_model_load: n_audio_layer = 6
whisper_model_load: n_text_ctx    = 448
whisper_model_load: n_text_state  = 512
whisper_model_load: n_text_head   = 8
whisper_model_load: n_text_layer  = 6
whisper_model_load: n_mels        = 80
whisper_model_load: f16           = 1
whisper_model_load: type          = 2
whisper_model_load: mem_required  = 506.00 MB
whisper_model_load: adding 1608 extra tokens
whisper_model_load: ggml ctx size = 140.60 MB
whisper_model_load: memory size =    22.83 MB
whisper_model_load: model size  =   140.54 MB

system_info: n_threads = 4 / 4 | AVX2 = 0 | AVX512 = 0 | NEON = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 |

whisper_print_timings:     load time =  2274.95 ms
whisper_print_timings:      mel time =     0.00 ms
whisper_print_timings:   sample time =     0.00 ms
whisper_print_timings:   encode time = 56033.60 ms / 9338.93 ms per layer
whisper_print_timings:   decode time =     0.00 ms / 0.00 ms per layer
whisper_print_timings:    total time = 58308.86 ms

If you wish, you can submit these results here:

  https://github.com/ggerganov/whisper.cpp/issues/89

Please include the following information:

  - CPU model
  - Operating system
  - Compiler
mycroft@OpenVoiceOS-e3830c:~ $ ./main -m models/ggml-base.bin -f samples/jfk.wav -t 4
whisper_model_load: loading model from 'models/ggml-base.bin'
whisper_model_load: n_vocab       = 51865
whisper_model_load: n_audio_ctx   = 1500
whisper_model_load: n_audio_state = 512
whisper_model_load: n_audio_head  = 8
whisper_model_load: n_audio_layer = 6
whisper_model_load: n_text_ctx    = 448
whisper_model_load: n_text_state  = 512
whisper_model_load: n_text_head   = 8
whisper_model_load: n_text_layer  = 6
whisper_model_load: n_mels        = 80
whisper_model_load: f16           = 1
whisper_model_load: type          = 2
whisper_model_load: mem_required  = 506.00 MB
whisper_model_load: adding 1608 extra tokens
whisper_model_load: ggml ctx size = 140.60 MB
whisper_model_load: memory size =    22.83 MB
whisper_model_load: model size  =   140.54 MB

system_info: n_threads = 4 / 4 | AVX2 = 0 | AVX512 = 0 | NEON = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 |

main: processing 'samples/jfk.wav' (176000 samples, 11.0 sec), 4 threads, 1 processors, lang = en, task = transcribe, timestamps = 1 ...


[00:00:00.000 --> 00:00:07.600]   And so my fellow Americans ask not what your country can do for you,
[00:00:07.600 --> 00:00:10.600]   ask what you can do for your country.


whisper_print_timings:     load time =  1338.00 ms
whisper_print_timings:      mel time =   287.82 ms
whisper_print_timings:   sample time =    41.32 ms
whisper_print_timings:   encode time = 55883.51 ms / 9313.92 ms per layer
whisper_print_timings:   decode time =  3145.44 ms / 524.24 ms per layer
whisper_print_timings:    total time = 60706.79 ms

The tiny and base are the only two models that can be tested on my RPI4 with 2 GB of memory. The small model already runs out of memory.

?

Did you switch the branch to the stream branch as it seems to run much faster?
PS I couldn’t work out how to enable Blas but in the issues it seems like it makes little difference.

Also prob on a PI better to try the tiny.en model 1st

git fetch --all
git checkout stream
git reset --hard origin/stream

make clean
make stream

./stream -m ./models/ggml-tiny.en.bin -t 4 --step 4000 --length 8000

But also using that repo seems almost 3x faster with non stream
make tiny.en

I reserve 1/3 of memory for a zram compressed swap system within the OpenVoiceOS system. Probably the same approach as what you linked.

Anyhow, running it just quickly fill up the memory and swap and then kills itself.

Combined benchmarks table

mycroft@OpenVoiceOS-e3830c:~ $ ./bench-all.sh 4
Usage: ./bench.sh [n_threads]

Running benchmark for the tiny and base models
This can take a while!

| CPU | OS | Config | Model | Threads | Load [ms] | Encode [ms] |
| --- | -- | ------ | ----- | ------- | --------- | ----------- |
| Raspberry Pi 4 | OpenVoiceOS |  NEON | tiny | 4 | 835.68 | 21509.08 |
| Raspberry Pi 4 | OpenVoiceOS |  NEON | base | 4 | 1183.47 | 55256.20 |


mycroft@OpenVoiceOS-e3830c:~ $ ./bench-all.sh 1
Usage: ./bench.sh [n_threads]

Running benchmark for the tiny and base models
This can take a while!

| CPU | OS | Config | Model | Threads | Load [ms] | Encode [ms] |
| --- | -- | ------ | ----- | ------- | --------- | ----------- |
| Raspberry Pi 4 | OpenVoiceOS |  NEON | tiny | 1 | 861.34 | 29428.21 |
| Raspberry Pi 4 | OpenVoiceOS |  NEON | base | 1 | 1146.02 | 87615.00 |
CPU OS Config Model Threads Load [ms] Encode [ms]
RK3588 Ubuntu20.04 NEON tiny.en 4 243.54 ms 779.49 ms
RK3588 Ubuntu20.04 NEON base.en 4 316.52 ms 1821.06 ms
RK3588 Ubuntu20.04 NEON small.en 4 618.93 ms 7117.69 ms
RK3588 Ubuntu20.04 NEON medium.en 4 1514.88 ms 24139.92 ms
CPU OS Config Model Threads Load [ms] Encode [ms]
RK3588 Ubuntu20.04 NEON tiny 4 233.86 ms 791.01 ms
RK3588 Ubuntu20.04 NEON base 4 297.93 ms 1813.69 ms
RK3588 Ubuntu20.04 NEON small 4 592.18 ms 7102.28 ms
RK3588 Ubuntu20.04 NEON medium 4 1587.36 ms 24147.87 ms
CPU OS Config Model Threads Load [ms] Encode [ms]
RK3588 Ubuntu20.04 NEON tiny 8 226.48 ms 740.34 ms
RK3588 Ubuntu20.04 NEON base 8 300.48 ms 1723.42 ms
RK3588 Ubuntu20.04 NEON small 8 620.58 ms 6392.47 ms
RK3588 Ubuntu20.04 NEON medium 8 1533.75 ms 21899.08 ms

I still haven’t worked out the little(0-3).Big(4-7) on this thing as if I pin to big cores taskset -c 4-7

CPU OS Config Model Threads Load [ms] Encode [ms]
RK3588 Ubuntu20.04 NEON tiny.en 4 234.14 ms 681.53 ms
RK3588 Ubuntu20.04 NEON base.en 4 297.08 ms 1679.75 ms
RK3588 Ubuntu20.04 NEON small.en 4 599.98 ms 6867.66 ms
RK3588 Ubuntu20.04 NEON medium.en 4 1492.73 ms 23600.45 ms

I tried to compile with openBlas but seemed to kill the make


From the master repo as didn’t think about the repo after trying stream

CPU OS Config Model Threads Load [ms] Encode [ms]
RK3588 Ubuntu20.04 NEON tiny 8 226.48 ms 2681.05 ms
RK3588 Ubuntu20.04 NEON base 8 283.56 ms 6132.44 ms
RK3588 Ubuntu20.04 NEON small 8 583.39 ms 24397.78 ms
RK3588 Ubuntu20.04 NEON medium 8 1490.98 85099.45 ms

I think all the stream repo does is cut the ‘ctc’ width from 30 sec to 10 sec. So on a short command sentance its much faster whilst it also is slightly faster with longer transcription.
Also in the overall timing you have the load time but that will only happen on 1st load so if it stays resident in memory you can discount that.
I would give tiny.en a go with the stream mode.
Should of prob loaded from eMMC or NVME but SD is like for like.

As it is all run on CPU obviously having more cycles available influences the results. Below table represents the different information a bit better.

With the OVOS services running represents performance when run with the STT plugin (Although the python bindings and library way might still alter the data a bit. Not sure for the better or for the worse).

Without the OVOS services running is more for comparison with the other data out there.

CPU OS Config Model Threads Load [ms] Encode [ms] Remarks
Raspberry Pi 4 - 2GB OpenVoiceOS NEON tiny 1 861.34 29428.21 With OVOS services running
Raspberry Pi 4 - 2GB OpenVoiceOS NEON tiny 4 835.68 21509.08 With OVOS services running
Raspberry Pi 4 - 2GB OpenVoiceOS NEON base 1 1146.02 87615.00 With OVOS services running
Raspberry Pi 4 - 2GB OpenVoiceOS NEON base 4 1183.47 55256.20 With OVOS services running
Raspberry Pi 4 - 2GB OpenVoiceOS NEON tiny 1 752.64 24018.10 Without OVOS services running
Raspberry Pi 4 - 2GB OpenVoiceOS NEON tiny 4 743.37 10122.80 Without OVOS services running
Raspberry Pi 4 - 2GB OpenVoiceOS NEON base 1 974.46 71587.61 Without OVOS services running
Raspberry Pi 4 - 2GB OpenVoiceOS NEON base 4 982.24 24814.62 Without OVOS services running

As you can see, especially with using all 4 threads this makes a difference. No so much with 1 thread as the other 3 are still available for the whole OVOS software stack.

2 Likes

Its a shame really as Whisper steps up a notch on the small model that ‘turn on the light’ for me is perfect in the tiny model with the faster stream repo.
Others though for me
tiny.en → [00:00:00.000 → 00:00:05.760] setting a loan for Monday at 9 o’clock [832.27 ms]
base.en → [00:00:00.000 → 00:00:05.800] setting alarm for Monday at 9 o’clock. [ 2238.11 ms]
small.en → [00:00:00.000 → 00:00:05.800] Set an alarm for Monday at 9 o’clock [6852.66 ms]

small.en steps up a notch and gets it perfect, but is about x1.5 realtime of the same 4.4 sec sample.

The tiny and base do fit in my 2GB rpi4 model, the small just needs the 4GB as a minimum.

Going to test the tiny and base model with @JarbasAl his STT plugin in the upcoming days.

Try the stream repo as whatever it does even with normal file feed its x3 faster and works just the same on short command sentences.
I think the Pi4 is lacking in performance and falls a bit short and a bit short is actually a long way.
With Whisper that is.

Whisper is just a transformer by OPenAi and likely there will be more and maybe someone will split the model so SoC’s with AI accelerators or GPU’s can partition the model and share load with the CPU.
A lot of the NPU’s are int8 only and that only really supports fairly simple models.

Did you compare the streaming with the latest commit on the normal?

Isn’t the speed coming from the 2x speed commit the recently pushed?

Yeah I am thinking my Rk3588 is a tad short as the model you need to be running as minimum is the Small model really as that is when it gets near to some of what OPenAI claim. In fact prob quite a bit off and 3x+ what you will get on a Pi4.
The base & tiny model start to fall off a cliff with WER and yeah it will run them just about, but the PI4 is not particually good at running transformer models even very heavily optimised code and quantised models, its more of a celebration that it runs than a working ASR.
The Whisper model is a translation model based on a 30 sec window and sort of far from optimised for short sentence smartAI commands even though Georgi Gerganov has made some god level coding optimisation and model hack to get where we are.
It has some english only models also hcked in and as examples they produced some vasty smaller models but the published WER and acclaim is for the large model that scales quite well to the small and then goes on a bit of a nose dive.
If the training methods of Whisper was also open source then likely it would be a different story and Linux has its 1st commericial comparitive ASR.
Part of it is the way Whisper works which is far less on phonetic correctness and what has the highest matrix score over a 30 second window and it gets longer sentances far more correct than it does short ones as they contain so much more sentence logic.
Get a mic and give it a try with typical command sentences than maybe reading book like narrative.
The shorter the sentence and the smaller the model the worse it gets.

Did some more testing with OpenBLAS. For Aarch64 this brings again some performance gains. For completeness-sake the full table;

CPU OS Config Model Threads Load [ms] Encode [ms] Remarks
Raspberry Pi 4 - 2GB OpenVoiceOS NEON tiny 1 861.34 29428.21 With OVOS services running
Raspberry Pi 4 - 2GB OpenVoiceOS NEON BLAS tiny 1 843.80 16145.62 With OVOS services running
Raspberry Pi 4 - 2GB OpenVoiceOS NEON tiny 4 835.68 21509.08 With OVOS services running
Raspberry Pi 4 - 2GB OpenVoiceOS NEON BLAS tiny 4 824.24 13187.96 With OVOS services running
Raspberry Pi 4 - 2GB OpenVoiceOS NEON base 1 1146.02 87615.00 With OVOS services running
Raspberry Pi 4 - 2GB OpenVoiceOS NEON BLAS base 1 1103.39 52228.30 With OVOS services running
Raspberry Pi 4 - 2GB OpenVoiceOS NEON base 4 1183.47 55256.20 With OVOS services running
Raspberry Pi 4 - 2GB OpenVoiceOS NEON BLAS base 4 1161.32 29851.40 With OVOS services running
Raspberry Pi 4 - 2GB OpenVoiceOS NEON tiny 1 752.64 24018.10 Without OVOS services running
Raspberry Pi 4 - 2GB OpenVoiceOS NEON BLAS tiny 1 751.96 13082.95 Without OVOS services running
Raspberry Pi 4 - 2GB OpenVoiceOS NEON tiny 4 743.37 10122.80 Without OVOS services running
Raspberry Pi 4 - 2GB OpenVoiceOS NEON BLAS tiny 4 742.90 9564.89 Without OVOS services running
Raspberry Pi 4 - 2GB OpenVoiceOS NEON base 1 974.46 71587.61 Without OVOS services running
Raspberry Pi 4 - 2GB OpenVoiceOS NEON BLAS base 1 979.65 43852.07 Without OVOS services running
Raspberry Pi 4 - 2GB OpenVoiceOS NEON base 4 982.24 24814.62 Without OVOS services running
Raspberry Pi 4 - 2GB OpenVoiceOS NEON BLAS base 4 982.80 19910.19 Without OVOS services running

With the tiny model, close to 1x real time;

mycroft@OpenVoiceOS-e3830c:~/whispercpp $ ./main -m models/ggml-tiny.bin -f samples/jfk.wav
whisper_model_load: loading model from 'models/ggml-tiny.bin'
whisper_model_load: n_vocab       = 51865
whisper_model_load: n_audio_ctx   = 1500
whisper_model_load: n_audio_state = 384
whisper_model_load: n_audio_head  = 6
whisper_model_load: n_audio_layer = 4
whisper_model_load: n_text_ctx    = 448
whisper_model_load: n_text_state  = 384
whisper_model_load: n_text_head   = 6
whisper_model_load: n_text_layer  = 4
whisper_model_load: n_mels        = 80
whisper_model_load: f16           = 1
whisper_model_load: type          = 1
whisper_model_load: mem_required  = 390.00 MB
whisper_model_load: adding 1608 extra tokens
whisper_model_load: ggml ctx size =  73.58 MB
whisper_model_load: memory size =    11.41 MB
whisper_model_load: model size  =    73.54 MB

system_info: n_threads = 4 / 4 | AVX2 = 0 | AVX512 = 0 | NEON = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 |

main: processing 'samples/jfk.wav' (176000 samples, 11.0 sec), 4 threads, 1 processors, lang = en, task = transcribe, timestamps = 1 ...


[00:00:00.000 --> 00:00:11.000]   And so my fellow Americans ask not what your country can do for you ask what you can do for your country.


whisper_print_timings:     load time =   741.98 ms
whisper_print_timings:      mel time =   200.68 ms
whisper_print_timings:   sample time =    22.64 ms
whisper_print_timings:   encode time =  9588.57 ms / 2397.14 ms per layer
whisper_print_timings:   decode time =   716.64 ms / 179.16 ms per layer
whisper_print_timings:    total time = 11276.80 ms

With the base mode, twice as slow however a little bit better transcribe as it picked up a comma :smiley: (Could be by accident)

mycroft@OpenVoiceOS-e3830c:~/whispercpp $ ./main -m models/ggml-base.bin -f samples/jfk.wav
whisper_model_load: loading model from 'models/ggml-base.bin'
whisper_model_load: n_vocab       = 51865
whisper_model_load: n_audio_ctx   = 1500
whisper_model_load: n_audio_state = 512
whisper_model_load: n_audio_head  = 8
whisper_model_load: n_audio_layer = 6
whisper_model_load: n_text_ctx    = 448
whisper_model_load: n_text_state  = 512
whisper_model_load: n_text_head   = 8
whisper_model_load: n_text_layer  = 6
whisper_model_load: n_mels        = 80
whisper_model_load: f16           = 1
whisper_model_load: type          = 2
whisper_model_load: mem_required  = 506.00 MB
whisper_model_load: adding 1608 extra tokens
whisper_model_load: ggml ctx size = 140.60 MB
whisper_model_load: memory size =    22.83 MB
whisper_model_load: model size  =   140.54 MB

system_info: n_threads = 4 / 4 | AVX2 = 0 | AVX512 = 0 | NEON = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 |

main: processing 'samples/jfk.wav' (176000 samples, 11.0 sec), 4 threads, 1 processors, lang = en, task = transcribe, timestamps = 1 ...


[00:00:00.000 --> 00:00:07.600]   And so my fellow Americans ask not what your country can do for you,
[00:00:07.600 --> 00:00:10.600]   ask what you can do for your country.


whisper_print_timings:     load time =   978.30 ms
whisper_print_timings:      mel time =   198.97 ms
whisper_print_timings:   sample time =    28.10 ms
whisper_print_timings:   encode time = 20292.19 ms / 3382.03 ms per layer
whisper_print_timings:   decode time =  1504.07 ms / 250.68 ms per layer
whisper_print_timings:    total time = 23004.49 ms

Just about managed to get the small model into memory, so here it goes;

mycroft@OpenVoiceOS-e3830c:~/whispercpp $ ./main -m models/ggml-small.bin -f samples/jfk.wav
whisper_model_load: loading model from 'models/ggml-small.bin'
whisper_model_load: n_vocab       = 51865
whisper_model_load: n_audio_ctx   = 1500
whisper_model_load: n_audio_state = 768
whisper_model_load: n_audio_head  = 12
whisper_model_load: n_audio_layer = 12
whisper_model_load: n_text_ctx    = 448
whisper_model_load: n_text_state  = 768
whisper_model_load: n_text_head   = 12
whisper_model_load: n_text_layer  = 12
whisper_model_load: n_mels        = 80
whisper_model_load: f16           = 1
whisper_model_load: type          = 3
whisper_model_load: mem_required  = 1044.00 MB
whisper_model_load: adding 1608 extra tokens
whisper_model_load: ggml ctx size = 464.56 MB
whisper_model_load: memory size =    68.48 MB
whisper_model_load: model size  =   464.44 MB

system_info: n_threads = 4 / 4 | AVX2 = 0 | AVX512 = 0 | NEON = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 |

main: processing 'samples/jfk.wav' (176000 samples, 11.0 sec), 4 threads, 1 processors, lang = en, task = transcribe, timestamps = 1 ...


[00:00:00.000 --> 00:00:11.000]   And so my fellow Americans, ask not what your country can do for you, ask what you can do for your country.


whisper_print_timings:     load time =  4153.16 ms
whisper_print_timings:      mel time =   207.59 ms
whisper_print_timings:   sample time =    32.41 ms
whisper_print_timings:   encode time = 80229.52 ms / 6685.79 ms per layer
whisper_print_timings:   decode time =  3845.80 ms / 320.48 ms per layer
whisper_print_timings:    total time = 88472.29 ms
1 Like

Yeah I just don’t think the Pi4 has the Ooomf to run transformer models.

Intel have been doing optimisations Fast DistilBERT on CPUs if you google, but currently there are a lot of models out of range of a Pi4 or at least like Whisper the very small heavilly quantised models bare little accuracy to which the bigger models are famed for.
I have got a 6Tops NPU embedded into the Rk3588 that I think will run x3 faster at least than CPU if I can get a int8 quantised version of whisper.
When you get an embedded NPU sharing address and memory with the Soc it gets much faster than the Coral USB addons or at least much faster than many of the coral benchmarks I have seen.
For £70 the x4 that the Coral USB seems to give makes me wonder if its worth it. It is really hard to compare as how TOPs are accounted, quantisation and models differ but probably the small.en whisper model would run will on a int8 converted for a NPU of approx 4-6 tops.
On the 6tops npu of the rk3588 it produces approx 190 fps with resnet-18 with 224x224 images but what does that mean?
The power draw on the npu is much less though as 1.5watt peak so as well as being more performant its also cooler and more efficient than a cpu which running the same model is about 5 watts and the only real comparison I can make.

I am thinking Georgi Gerganov has already taken optimisation to the max and currently, it is use a GPU or keep your fingers crossed for a int8 npu version.

This looks interesting as its been converted to tflite but also looks like its been seperated into distinct functions.

Its tflite and by the looks of it much already converted to int8 so likely would run on a Coral accelerator or even partition a model between cpu & npu.

INFO: Created TensorFlow Lite XNNPACK delegate for CPU.
Inference time 2 seconds

[_SOT_][_NOT_] And so my fellow Americans ask not what your country can do for you, ask what you can do for your country.

PS has anyone tried the bigger models of Whisper as for some reason it not mentioned but you can have crazy SNR levels and the medium or large model compared to others by magic get transcription correct.

I have been playing with DeepFilterNet which is great but not much good for Whisper as it actually makes results worse.
If you have a model that you can train then likely you would have to preprocess the dataset with DeepFilterNet but quite likely in terms of noise do as well. OpenAi have kept the training of Whisper to themselves but you should try the larger models with noise as the results are outstanding.