Adding "Whisper" as local STT option?

Hi,
as whisper offers a really great local STT service i thought about if it’s possible to add it as STT option. I didn’t give this idea a deep dive yet, but i’d be interested on your thoughts. Should this be possible?

Thorsten

yes, the plugin system should be adaptable to use it.Jarbas probably has a sample version he’s about to post…

1 Like

Thanks @baconator , then i’m excited to hear from @JarbasAl experiences :slight_smile: .

i will have a plugin later, but it needs way too many resources to be a default option, most of the time wont be usable. I will be focusing more on a server side solution for the plugin so you can move the problem to another device…

2 Likes

first working prototype

https://github.com/OpenVoiceOS/ovos-stt-plugin-whispercpp

2 Likes

and V2 around the corner no longer using subprocess calls

1 Like

Anyone tried this repo?

High-performance inference of OpenAI’s Whisper automatic speech recognition (ASR) model:

Plain C/C++ implementation without dependencies
Apple silicon first-class citizen - optimized via Arm Neon and Accelerate framework
AVX intrinsics support for x86 architectures
Mixed F16 / F32 precision
Low memory usage (Flash Attention + Flash Forward)
Zero memory allocations at runtime
Runs on the CPU
C-style API
Supported platforms: Linux, Mac OS (Intel and Arm), Windows (MSVC and MinGW), WebAssembly, Raspberry Pi, Android

To be honest with stock availability my default raspberry platform is looking ever in doubt, but the above is cpu based. I don’t know how much faster it is than the openAI source but it is optimised for cpu, but guessing at best it will be the tiny model (which is not that great as its the small model up where Whisper excels).

Think I will give it a try on my getting old workstation Intel(R) Xeon(R) CPU E3-1245 and new toy of a Rock5/hardware/5b - Radxa Wiki (OkDo are going to start stocking) as alternatives to Pi are becoming very valid and see what model works before pushing over realtime, doubt it will do the medium model but tiny, base and small to choose from.

Has a nice streaming input also GitHub - ggerganov/whisper.cpp: Port of OpenAI's Whisper model in C/C++ and some interesting examples.

ROCK 5B Rockchip RK3588 ARM Cortex-A76

rock@rock-5b:~/whisper.cpp$ ./main -m models/ggml-base.en.bin -f samples/jfk.wav
whisper_model_load: loading model from 'models/ggml-base.en.bin'
whisper_model_load: n_vocab       = 51864
whisper_model_load: n_audio_ctx   = 1500
whisper_model_load: n_audio_state = 512
whisper_model_load: n_audio_head  = 8
whisper_model_load: n_audio_layer = 6
whisper_model_load: n_text_ctx    = 448
whisper_model_load: n_text_state  = 512
whisper_model_load: n_text_head   = 8
whisper_model_load: n_text_layer  = 6
whisper_model_load: n_mels        = 80
whisper_model_load: f16           = 1
whisper_model_load: type          = 2
whisper_model_load: mem_required  = 505.00 MB
whisper_model_load: adding 1607 extra tokens
whisper_model_load: ggml ctx size = 163.43 MB
whisper_model_load: memory size =    22.83 MB
whisper_model_load: model size  =   140.54 MB

main: processing 'samples/jfk.wav' (176000 samples, 11.0 sec), 4 threads, lang = en, task = transcribe, timestamps = 1 ...


[00:00:00.000 --> 00:00:11.000]   And so my fellow Americans, ask not what your country can do for you, ask what you can do for your country.


whisper_print_timings:     load time =   318.74 ms
whisper_print_timings:      mel time =   123.62 ms
whisper_print_timings:   sample time =     0.00 ms
whisper_print_timings:   encode time =  6228.12 ms / 1038.02 ms per layer
whisper_print_timings:   decode time =   758.88 ms / 126.48 ms per layer
whisper_print_timings:    total time =  7442.09 ms

Xeon(R) CPU E3-1245

./main -m models/ggml-base.en.bin -f samples/jfk.wav
whisper_model_load: loading model from 'models/ggml-base.en.bin'
whisper_model_load: n_vocab       = 51864
whisper_model_load: n_audio_ctx   = 1500
whisper_model_load: n_audio_state = 512
whisper_model_load: n_audio_head  = 8
whisper_model_load: n_audio_layer = 6
whisper_model_load: n_text_ctx    = 448
whisper_model_load: n_text_state  = 512
whisper_model_load: n_text_head   = 8
whisper_model_load: n_text_layer  = 6
whisper_model_load: n_mels        = 80
whisper_model_load: f16           = 1
whisper_model_load: type          = 2
whisper_model_load: mem_required  = 505.00 MB
whisper_model_load: adding 1607 extra tokens
whisper_model_load: ggml ctx size = 163.43 MB
whisper_model_load: memory size =    22.83 MB 
whisper_model_load: model size  =   140.54 MB

main: processing 'samples/jfk.wav' (176000 samples, 11.0 sec), 4 threads, lang = en, task = transcribe, timestamps = 1 ...


[00:00:00.000 --> 00:00:11.000]   And so my fellow Americans, ask not what your country can do for you, ask what you can do for your country.


whisper_print_timings:     load time =   221.60 ms
whisper_print_timings:      mel time =    85.55 ms
whisper_print_timings:   sample time =     0.00 ms
whisper_print_timings:   encode time =  1707.26 ms / 284.54 ms per layer
whisper_print_timings:   decode time =   183.90 ms / 30.65 ms per layer
whisper_print_timings:    total time =  2211.89 ms

Playing some more and setting threads to max=8 and a compare of the tiny, base, small & medium models on the Rock5b Rk3588

./main -m models/ggml-tiny.en.bin -f samples/jfk.wav -t 8

main: processing 'samples/jfk.wav' (176000 samples, 11.0 sec), 8 threads, lang = en, task = transcribe, timestamps = 1 ...


[00:00:00.000 --> 00:00:07.740]   And so my fellow Americans ask not what your country can do for you
[00:00:07.740 --> 00:00:10.740]   ask what you can do for your country


whisper_print_timings:     load time =  1431.40 ms
whisper_print_timings:      mel time =   114.11 ms
whisper_print_timings:   sample time =     0.00 ms
whisper_print_timings:   encode time =  2746.18 ms / 686.54 ms per layer
whisper_print_timings:   decode time =   353.36 ms / 88.34 ms per layer
whisper_print_timings:    total time =  4663.89 ms
./main -m models/ggml-base.en.bin -f samples/jfk.wav -t 8

main: processing 'samples/jfk.wav' (176000 samples, 11.0 sec), 8 threads, lang = en, task = transcribe, timestamps = 1 ...


[00:00:00.000 --> 00:00:11.000]   And so my fellow Americans, ask not what your country can do for you, ask what you can do for your country.


whisper_print_timings:     load time =   320.30 ms
whisper_print_timings:      mel time =   111.54 ms
whisper_print_timings:   sample time =     0.00 ms
whisper_print_timings:   encode time =  6148.25 ms / 1024.71 ms per layer
whisper_print_timings:   decode time =   580.88 ms / 96.81 ms per layer
whisper_print_timings:    total time =  7173.88 ms
./main -m models/ggml-small.en.bin -f samples/jfk.wav -t 8

main: processing 'samples/jfk.wav' (176000 samples, 11.0 sec), 8 threads, lang = en, task = transcribe, timestamps = 1 ...


[00:00:00.000 --> 00:00:08.000]   And so, my fellow Americans, ask not what your country can do for you.
[00:00:08.000 --> 00:00:11.000]   Ask what you can do for your country.


whisper_print_timings:     load time =   644.22 ms
whisper_print_timings:      mel time =   122.85 ms
whisper_print_timings:   sample time =     0.00 ms
whisper_print_timings:   encode time = 24924.77 ms / 2077.06 ms per layer
whisper_print_timings:   decode time =  2036.42 ms / 169.70 ms per layer
whisper_print_timings:    total time = 27742.79 ms
./main -m models/ggml-medium.en.bin -f samples/jfk.wav -t 8

main: processing 'samples/jfk.wav' (176000 samples, 11.0 sec), 8 threads, lang = en, task = transcribe, timestamps = 1 ...

./main -m models/ggml-medium.en.bin -f samples/jfk.wav -t 8
[00:00:00.000 --> 00:00:11.000]   And so my fellow Americans, ask not what your country can do for you, ask what you can do for your country.


whisper_print_timings:     load time = 24878.33 ms
whisper_print_timings:      mel time =   122.06 ms
whisper_print_timings:   sample time =     0.00 ms
whisper_print_timings:   encode time = 87195.62 ms / 3633.15 ms per layer
whisper_print_timings:   decode time =  4881.50 ms / 203.40 ms per layer
whisper_print_timings:    total time = 117097.61 ms

Running from nvme also helps as the above is on the sdcard as is booting from.

./main -m models/ggml-medium.en.bin -f samples/jfk.wav -t 8

main: processing 'samples/jfk.wav' (176000 samples, 11.0 sec), 8 threads, lang = en, task = transcribe, timestamps = 1 ...


[00:00:00.000 --> 00:00:11.000]   And so my fellow Americans, ask not what your country can do for you, ask what you can do for your country.


whisper_print_timings:     load time =  2024.17 ms
whisper_print_timings:      mel time =   108.58 ms
whisper_print_timings:   sample time =     0.00 ms
whisper_print_timings:   encode time = 86100.46 ms / 3587.52 ms per layer
whisper_print_timings:   decode time =  4895.51 ms / 203.98 ms per layer
whisper_print_timings:    total time = 93143.08 ms

It scales really well on threads/cores especially if you have a monster PC.
Its something to do with the author as he is some brilliant scientist or something as he has created his own Tensor library for machine learning GitHub - ggerganov/ggml: Tensor library for machine learning that is cpu based and optimised for Neon & AVX.
I have been shocked how well it scales as with (tensorflow DTLN) on x2 threads you get an improvement but not x2 and then with x4 threads it seems to make little difference to x2.
Its an amazing repo how it scales on CPU but really its a model that you wouldn’t want to run on cpu alone or the smaller models.
Been extremely impressed of the accuracy of the bigger models as wow F…

PS the Dev forwarded me a Pi4 bench

pi@raspberrypi:~/whisper.cpp $ ./main -m models/ggml-base.en.bin -f samples/jfk.wav -t 4
whisper_model_load: loading model from 'models/ggml-base.en.bin'
whisper_model_load: n_vocab       = 51864
whisper_model_load: n_audio_ctx   = 1500
whisper_model_load: n_audio_state = 512
whisper_model_load: n_audio_head  = 8
whisper_model_load: n_audio_layer = 6
whisper_model_load: n_text_ctx    = 448
whisper_model_load: n_text_state  = 512
whisper_model_load: n_text_head   = 8
whisper_model_load: n_text_layer  = 6
whisper_model_load: n_mels        = 80
whisper_model_load: f16           = 1
whisper_model_load: type          = 2
whisper_model_load: mem_required  = 505.00 MB
whisper_model_load: adding 1607 extra tokens
whisper_model_load: ggml ctx size = 163.43 MB
whisper_model_load: memory size =    22.83 MB 
whisper_model_load: model size  =   140.54 MB

main: processing 'samples/jfk.wav' (176000 samples, 11.0 sec), 4 threads, lang = en, task = transcribe, timestamps = 1 ...


[00:00:00.000 --> 00:00:11.000]   And so my fellow Americans, ask not what your country can do for you, ask what you can do for your country.


whisper_print_timings:     load time =  1851.33 ms
whisper_print_timings:      mel time =   270.67 ms
whisper_print_timings:   sample time =     0.00 ms
whisper_print_timings:   encode time = 33790.07 ms / 5631.68 ms per layer
whisper_print_timings:   decode time =  1287.69 ms / 214.61 ms per layer
whisper_print_timings:    total time = 37281.19 ms

Rock5b

rock@rock-5b:~/nvme/whisper.cpp$ ./main -m models/ggml-base.en.bin -f samples/jf                                                                                                                                   k.wav -t 8
whisper_model_load: loading model from 'models/ggml-base.en.bin'
whisper_model_load: n_vocab       = 51864
whisper_model_load: n_audio_ctx   = 1500
whisper_model_load: n_audio_state = 512
whisper_model_load: n_audio_head  = 8
whisper_model_load: n_audio_layer = 6
whisper_model_load: n_text_ctx    = 448
whisper_model_load: n_text_state  = 512
whisper_model_load: n_text_head   = 8
whisper_model_load: n_text_layer  = 6
whisper_model_load: n_mels        = 80
whisper_model_load: f16           = 1
whisper_model_load: type          = 2
whisper_model_load: mem_required  = 505.00 MB
whisper_model_load: adding 1607 extra tokens
whisper_model_load: ggml ctx size = 163.43 MB
whisper_model_load: memory size =    22.83 MB
whisper_model_load: model size  =   140.54 MB

main: processing 'samples/jfk.wav' (176000 samples, 11.0 sec), 8 threads, lang =                                                                                                                                    en, task = transcribe, timestamps = 1 ...


[00:00:00.000 --> 00:00:11.000]   And so my fellow Americans, ask not what your                                                                                                                                    country can do for you, ask what you can do for your country.


whisper_print_timings:     load time =   313.91 ms
whisper_print_timings:      mel time =   107.60 ms
whisper_print_timings:   sample time =     0.00 ms
whisper_print_timings:   encode time =  6165.18 ms / 1027.53 ms per layer
whisper_print_timings:   decode time =   657.71 ms / 109.62 ms per layer
whisper_print_timings:    total time =  7256.87 ms

Rock5b 5.137 times faster than a Pi4 which is interesting prob due to the Mac optimisation which is ARM8.2 architecture and cores?

1 Like

that repo is what the plugin i linked above uses internally :slight_smile:

I never noticed Jarbas I presumed you where using the original openAi with GPU support as never read the thread, as even though what Georgi has done is amazing you still need a monster CPU to run it.

Whisper is absolutely amazing but the WER accuracy drops off substantially on the Tiny & Base (Multi) models and it doesn’t make a lot of sense to use those as for Load/Wer there are probably older and better solutions. The Small model (multi) starts to get near those great WER % Whisper produces but the size and load is sort of exponential

Size Parameters English-only model Multilingual model Required VRAM Relative speed
tiny 39 M tiny.en tiny ~1 GB ~32x
base 74 M base.en base ~1 GB ~16x
small 244 M small.en small ~2 GB ~6x
medium 769 M medium.en medium ~5 GB ~2x
large 1550 M N/A large ~10 GB 1x

I haven’t tried the original on GPU and also the streaming CPU mode seems to send things crazy again where the load balloons even more.
Loading a full wav on a MacBook M1 Pro returned almost x6 realtime medium.en which I presume is as accurate as the Large Multilanguage model, but presuming its a similar story with the streaming input where dunno why it seems so much more load, but hazard a guess streaming on a MacBook M1 Pro might only be able to do the base.en model.

It really is amazing but wow its a total monster of a model I am presuming the Multilanguage model is similar in accuracy to the previous smaller specific language model but never tried Multilanguage.

yeah, this will really need a beefy machine for multilang, but im hoping the tiny.en model will be usable in a pi4…

my intended usage it to use this plugin with ovos TTS Server - Documentation so when i am done with the docker file i can have a large model running in a beefy setup and the pis just sending http requests

Not sure if it makes sense as the WER % drops off a cliff for the tiny & base models (suposedly from another reviewer) but yeah for a larger but dunno about running those on CPU as say running on GPU after some time screaming at my computer and trying to install cuda11.6 on ubuntu 22.04, use one of the Nvidia docker containers instead as I give up!
But install the right torch 1st
pip3 install torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/cu116
Then install Whisper
pip install git+https://github.com/openai/whisper.git
Using https://commons.wikimedia.org/wiki/File:Reagan_Space_Shuttle_Challenger_Speech.ogv 4m:48s
time whisper Reagan_Space_Shuttle_Challenger_Speech.ogv --best_of None --beam_size None --model medium.en --threads=8

real 0m42.072s
user 0m46.303s
sys 0m3.591s

time whisper Reagan_Space_Shuttle_Challenger_Speech.ogv --best_of None --beam_size None --model small.en --threads=8

real 0m22.323s
user 0m24.127s
sys 0m2.545s

time whisper Reagan_Space_Shuttle_Challenger_Speech.ogv --best_of None --beam_size None --model base.en --threads=8

real 0m13.119s
user 0m14.324s
sys 0m2.137s

time whisper Reagan_Space_Shuttle_Challenger_Speech.ogv --best_of None --beam_size None --model tiny.en --threads=8

real 0m10.855s
user 0m11.907s
sys 0m2.106s

So thinking even though that is only a RTX3050 desktop GPU tad slower than a GTX1070 from memory it still beats the pants out of running on CPU unless something pretty awesome.
With GPU when testing run twice as the model load into vram accounts for much but 2nd run is far faster I think it is purely the model load.
For Mac users on Arm GitHub - ggerganov/whisper.cpp: Port of OpenAI's Whisper model in C/C++ is amazing but maybe not so if you have a GPU… Just been reading the Metal framework has been avail on pytorch since June.

PS the ‘review’ was when I was just googling and knew nothing about Whisper and just coincidence we have all seemed found it, thinking about it I have no idea if the review was correct as thinking about it it did seem a bit critical. Said Whisper is very good but occasionally it gets things totally wrong and that it could of just been the multilingual tiny & base models it was critical of and that published WER was optimistic and it just stuck in my head and could be overly critical?

Just to go on about about WER/Load https://arxiv.org/pdf/2005.03191.pdf

ContextNet achieves a word error rate (WER) of 2.1%/4.6% without external language model (LM), 1.9%/4.1% with LM and 2.9%/7.0% with only 10M parameters on the clean/noisy LibriSpeech test sets

Which is likely a much better fit for a Pi4 with 10M parameters being a quarter of the whisper Tiny model and very likely directly converts to inference speed.
I have always liked GitHub - TensorSpeech/TensorFlowASR: TensorFlowASR: Almost State-of-the-art Automatic Speech Recognition in Tensorflow 2. Supported languages that can use characters or subwords and maybe because of my preference over pytorch that you can do the same things with PyTorch but with TFLite I have a reasonable knowledge how easy it is to use a TFlite Coral Delegate, or Mali or whatever or partition a model so it runs across several simultaneously of cpu/gpu/npu which is why I have the RK3588.
Same with TTS with GitHub - TensorSpeech/TensorFlowTTS: TensorFlowTTS: Real-Time State-of-the-art Speech Synthesis for Tensorflow 2 (supported including English, French, Korean, Chinese, German and Easy to adapt for other languages) as conversion to Tflite and support for embedded and accelerators seem much better, or at least was and now its because I am dodging Pytorch I am lacking knowledge.

Arm do an interesting Tutorial with ArmNN that is pretty awful in results but prob going to play with it to see if I can get the Mali delegate working.

https://developer.arm.com/documentation/102603/2108/Device-specific-installation/Install-on-Raspberry-Pi

 rock@rock-5b:~/workspace/armnn/python/pyarmnn/examples/speech_recognition$ time python3 run_audio_file.py --audio_file_path tests/testdata/quick_brown_fox_16000khz.wav --model_file_path tflite_int8/wav2letter_int8.tflite  --preferred_backends CpuAcc CpuRef
Your ArmNN library instance does not support Onnx models parser functionality.  Skipped IOnnxParser import.
Preferred backends: ['CpuAcc', 'CpuRef']
IDeviceSpec { supportedBackends: [CpuAcc, CpuRef]}
Optimization warnings: ()
Processing Audio Frames...
the quick brown fox juhmpe over the llazy dag

real    0m2.693s
user    0m8.031s
sys     0m0.282s

Just to see if I can switch to GPU with the Mali G610MP4 which is supposed to have pretty good ML perf

# Copyright © 2021 Arm Ltd and Contributors. All rights reserved.
# SPDX-License-Identifier: MIT

"""Automatic speech recognition with PyArmNN demo for processing audio clips to text."""

import sys
import os
import numpy as np

script_dir = os.path.dirname(__file__)
sys.path.insert(1, os.path.join(script_dir, '..', 'common'))

from argparse import ArgumentParser
from network_executor import ArmnnNetworkExecutor
from utils import prepare_input_data
from audio_capture import AudioCaptureParams, capture_audio
from audio_utils import decode_text, display_text
from wav2letter_mfcc import Wav2LetterMFCC, W2LAudioPreprocessor
from mfcc import MFCCParams

# Model Specific Labels
labels = {0: 'a', 1: 'b', 2: 'c', 3: 'd', 4: 'e', 5: 'f', 6: 'g', 7: 'h', 8: 'i', 9: 'j', 10: 'k', 11: 'l', 12: 'm',
          13: 'n',
          14: 'o', 15: 'p', 16: 'q', 17: 'r', 18: 's', 19: 't', 20: 'u', 21: 'v', 22: 'w', 23: 'x', 24: 'y',
          25: 'z',
          26: "'", 27: ' ', 28: '$'}


def parse_args():
    parser = ArgumentParser(description="ASR with PyArmNN")
    parser.add_argument(
        "--audio_file_path",
        required=True,
        type=str,
        help="Path to the audio file to perform ASR",
    )
    parser.add_argument(
        "--model_file_path",
        required=True,
        type=str,
        help="Path to ASR model to use",
    )
    parser.add_argument(
        "--preferred_backends",
        type=str,
        nargs="+",
        default=["CpuAcc", "CpuRef"],
        help="""List of backends in order of preference for optimizing
        subgraphs, falling back to the next backend in the list on unsupported
        layers. Defaults to [CpuAcc, CpuRef]""",
    )
    return parser.parse_args()


def main(args):
    # Read command line args
    audio_file = args.audio_file_path

    # Create the ArmNN inference runner
    network = ArmnnNetworkExecutor(args.model_file_path, args.preferred_backends)

    # Specify model specific audio data requirements
    audio_capture_params = AudioCaptureParams(dtype=np.float32, overlap=31712, min_samples=47712, sampling_freq=16000,
                                              mono=True)

    buffer = capture_audio(audio_file, audio_capture_params)

    # Extract features and create the preprocessor

    mfcc_params = MFCCParams(sampling_freq=16000, num_fbank_bins=128, mel_lo_freq=0, mel_hi_freq=8000,
                             num_mfcc_feats=13, frame_len=512, use_htk_method=False, n_fft=512)

    wmfcc = Wav2LetterMFCC(mfcc_params)
    preprocessor = W2LAudioPreprocessor(wmfcc, model_input_size=296, stride=160)
    current_r_context = ""
    is_first_window = True

    print("Processing Audio Frames...")
    for audio_data in buffer:
        # Prepare the input Tensors
        input_data = prepare_input_data(audio_data, network.get_data_type(), network.get_input_quantization_scale(0),
                                        network.get_input_quantization_offset(0), preprocessor)

        # Run inference
        output_result = network.run([input_data])

        # Slice and Decode the text, and store the right context
        current_r_context, text = decode_text(is_first_window, labels, output_result)

        is_first_window = False

        display_text(text)

    print(current_r_context, flush=True)


if __name__ == "__main__":
    args = parse_args()
    main(args)

Wav2Letter seems to be exactly what the name is and lacks a context dictionary, but it was ArmNN that was of interest as unlike a Pi many Arm boards are now have quite capable GPU’s and NPU’s.
The 8gb $149.00 Rock5b might sound expensive when compared to a Pi4 but on the CPU it ran Whisper x5 faster and also has the most powerful Mali based GPU I have seen and also a 6 Tops NPU (Supposedly) as Tops is not a good metric.

These boards are never going to compete with the latest and greatest GPU’s but the can partition models and uses system ram where with a dedicated GPU you may want allocate to a single model already loaded in vram.
But either way server based systems shared across clients (satelites) is a far superior infrastructure as the diversification of commands is inherently client-server with the big load of ASR & TTS being idle the majority of time and a time frame of use where queued clashes are low.

Rock5b is still only shipping to the early adopters but apparently OKDO will stocking them, the carrot was $50 off as the distro images are still extremely raw.

The latest OpenVoiceOS images do by default ship with libwhispercpp installed together with the STT @JarbasAl linked above.

Interesting to see how the developemnt will go as it looks like it goes in a rapid pace.

Here are some benchmarks and test similar as what @StuartIanNaylor posted with WhisperCPP cross compiled within the whole buildroot system. I might redo them later with libwhispercpp compiled with the OpenBLAS option.

Benchmark - tiny model (as that is the default of the STT plugin)

mycroft@OpenVoiceOS-e3830c:~ $ ./bench -m ggml-tiny.bin -t 4
whisper_model_load: loading model from 'ggml-tiny.bin'
whisper_model_load: n_vocab       = 51865
whisper_model_load: n_audio_ctx   = 1500
whisper_model_load: n_audio_state = 384
whisper_model_load: n_audio_head  = 6
whisper_model_load: n_audio_layer = 4
whisper_model_load: n_text_ctx    = 448
whisper_model_load: n_text_state  = 384
whisper_model_load: n_text_head   = 6
whisper_model_load: n_text_layer  = 4
whisper_model_load: n_mels        = 80
whisper_model_load: f16           = 1
whisper_model_load: type          = 1
whisper_model_load: mem_required  = 390.00 MB
whisper_model_load: adding 1608 extra tokens
whisper_model_load: ggml ctx size =  73.58 MB
whisper_model_load: memory size =    11.41 MB
whisper_model_load: model size  =    73.54 MB

system_info: n_threads = 4 / 4 | AVX2 = 0 | AVX512 = 0 | NEON = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 |

whisper_print_timings:     load time =  1693.92 ms
whisper_print_timings:      mel time =     0.00 ms
whisper_print_timings:   sample time =     0.00 ms
whisper_print_timings:   encode time = 20830.68 ms / 5207.67 ms per layer
whisper_print_timings:   decode time =     0.00 ms / 0.00 ms per layer
whisper_print_timings:    total time = 22524.77 ms

If you wish, you can submit these results here:

  https://github.com/ggerganov/whisper.cpp/issues/89

Please include the following information:

  - CPU model
  - Operating system
  - Compiler
mycroft@OpenVoiceOS-e3830c:~ $ ./main -m ggml-tiny.bin -f jfk.wav -t 4
whisper_model_load: loading model from 'ggml-tiny.bin'
whisper_model_load: n_vocab       = 51865
whisper_model_load: n_audio_ctx   = 1500
whisper_model_load: n_audio_state = 384
whisper_model_load: n_audio_head  = 6
whisper_model_load: n_audio_layer = 4
whisper_model_load: n_text_ctx    = 448
whisper_model_load: n_text_state  = 384
whisper_model_load: n_text_head   = 6
whisper_model_load: n_text_layer  = 4
whisper_model_load: n_mels        = 80
whisper_model_load: f16           = 1
whisper_model_load: type          = 1
whisper_model_load: mem_required  = 390.00 MB
whisper_model_load: adding 1608 extra tokens
whisper_model_load: ggml ctx size =  73.58 MB
whisper_model_load: memory size =    11.41 MB
whisper_model_load: model size  =    73.54 MB

system_info: n_threads = 4 / 4 | AVX2 = 0 | AVX512 = 0 | NEON = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 |

main: processing 'jfk.wav' (176000 samples, 11.0 sec), 4 threads, 1 processors, lang = en, task = transcribe, timestamps = 1 ...


[00:00:00.000 --> 00:00:11.000]   And so my fellow Americans ask not what your country can do for you ask what you can do for your country.


whisper_print_timings:     load time =   930.82 ms
whisper_print_timings:      mel time =   325.40 ms
whisper_print_timings:   sample time =    34.11 ms
whisper_print_timings:   encode time = 21394.58 ms / 5348.65 ms per layer
whisper_print_timings:   decode time =  1241.69 ms / 310.42 ms per layer
whisper_print_timings:    total time = 23929.93 ms

Strangely I am sat infront of a mic retrying the streaming mode as I never checked out the streaming branch before so prob why results where bad.
Haven’t got a Pi4 anymore so running on a Rock5b RK3588.
Streaming has had an update and seems really good but not sure on short command sentences if streaming mode is worthwhile when latency is so short and non streaming accuracy is really really good.

I will post a bench here again just as check to see the non stream perf as with streaming mode you can only really check load as you are feeding a stream that is a bit hard to measure.
Still the old rules apply you need to get a good signal from your Mic in terms of volume which often is pretty poor without putting on a AGC and volume to max.

I can show you load no Pi4 to try on though.

No BLAS

./main -m models/ggml-tiny.en.bin -f samples/jfk.wav
whisper_model_load: loading model from 'models/ggml-tiny.en.bin'
whisper_model_load: n_vocab       = 51864
whisper_model_load: n_audio_ctx   = 1500
whisper_model_load: n_audio_state = 384
whisper_model_load: n_audio_head  = 6
whisper_model_load: n_audio_layer = 4
whisper_model_load: n_text_ctx    = 448
whisper_model_load: n_text_state  = 384
whisper_model_load: n_text_head   = 6
whisper_model_load: n_text_layer  = 4
whisper_model_load: n_mels        = 80
whisper_model_load: f16           = 1
whisper_model_load: type          = 1
whisper_model_load: mem_required  = 390.00 MB
whisper_model_load: adding 1607 extra tokens
whisper_model_load: ggml ctx size =  73.58 MB
whisper_model_load: memory size =    11.41 MB
whisper_model_load: model size  =    73.54 MB

system_info: n_threads = 4 / 8 | AVX2 = 0 | AVX512 = 0 | NEON = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 |

main: processing 'samples/jfk.wav' (176000 samples, 11.0 sec), 4 threads, 1 processors, lang = en, task = transcribe, timestamps = 1 ...


[00:00:00.000 --> 00:00:07.540]   And so my fellow Americans ask not what your country can do for you
[00:00:07.540 --> 00:00:10.160]   ask what you can do for your country.
[00:00:10.160 --> 00:00:30.000]   You can do for your country


whisper_print_timings:     load time =   305.88 ms
whisper_print_timings:      mel time =   134.55 ms
whisper_print_timings:   sample time =    11.85 ms
whisper_print_timings:   encode time =   802.06 ms / 200.51 ms per layer
whisper_print_timings:   decode time =   321.21 ms / 80.30 ms per layer
whisper_print_timings:    total time =  1576.53 ms

Benchmark - base model

mycroft@OpenVoiceOS-e3830c:~ $ ./bench -m models/ggml-base.bin -t 4
whisper_model_load: loading model from 'models/ggml-base.bin'
whisper_model_load: n_vocab       = 51865
whisper_model_load: n_audio_ctx   = 1500
whisper_model_load: n_audio_state = 512
whisper_model_load: n_audio_head  = 8
whisper_model_load: n_audio_layer = 6
whisper_model_load: n_text_ctx    = 448
whisper_model_load: n_text_state  = 512
whisper_model_load: n_text_head   = 8
whisper_model_load: n_text_layer  = 6
whisper_model_load: n_mels        = 80
whisper_model_load: f16           = 1
whisper_model_load: type          = 2
whisper_model_load: mem_required  = 506.00 MB
whisper_model_load: adding 1608 extra tokens
whisper_model_load: ggml ctx size = 140.60 MB
whisper_model_load: memory size =    22.83 MB
whisper_model_load: model size  =   140.54 MB

system_info: n_threads = 4 / 4 | AVX2 = 0 | AVX512 = 0 | NEON = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 |

whisper_print_timings:     load time =  2274.95 ms
whisper_print_timings:      mel time =     0.00 ms
whisper_print_timings:   sample time =     0.00 ms
whisper_print_timings:   encode time = 56033.60 ms / 9338.93 ms per layer
whisper_print_timings:   decode time =     0.00 ms / 0.00 ms per layer
whisper_print_timings:    total time = 58308.86 ms

If you wish, you can submit these results here:

  https://github.com/ggerganov/whisper.cpp/issues/89

Please include the following information:

  - CPU model
  - Operating system
  - Compiler
mycroft@OpenVoiceOS-e3830c:~ $ ./main -m models/ggml-base.bin -f samples/jfk.wav -t 4
whisper_model_load: loading model from 'models/ggml-base.bin'
whisper_model_load: n_vocab       = 51865
whisper_model_load: n_audio_ctx   = 1500
whisper_model_load: n_audio_state = 512
whisper_model_load: n_audio_head  = 8
whisper_model_load: n_audio_layer = 6
whisper_model_load: n_text_ctx    = 448
whisper_model_load: n_text_state  = 512
whisper_model_load: n_text_head   = 8
whisper_model_load: n_text_layer  = 6
whisper_model_load: n_mels        = 80
whisper_model_load: f16           = 1
whisper_model_load: type          = 2
whisper_model_load: mem_required  = 506.00 MB
whisper_model_load: adding 1608 extra tokens
whisper_model_load: ggml ctx size = 140.60 MB
whisper_model_load: memory size =    22.83 MB
whisper_model_load: model size  =   140.54 MB

system_info: n_threads = 4 / 4 | AVX2 = 0 | AVX512 = 0 | NEON = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 |

main: processing 'samples/jfk.wav' (176000 samples, 11.0 sec), 4 threads, 1 processors, lang = en, task = transcribe, timestamps = 1 ...


[00:00:00.000 --> 00:00:07.600]   And so my fellow Americans ask not what your country can do for you,
[00:00:07.600 --> 00:00:10.600]   ask what you can do for your country.


whisper_print_timings:     load time =  1338.00 ms
whisper_print_timings:      mel time =   287.82 ms
whisper_print_timings:   sample time =    41.32 ms
whisper_print_timings:   encode time = 55883.51 ms / 9313.92 ms per layer
whisper_print_timings:   decode time =  3145.44 ms / 524.24 ms per layer
whisper_print_timings:    total time = 60706.79 ms

The tiny and base are the only two models that can be tested on my RPI4 with 2 GB of memory. The small model already runs out of memory.

?

Did you switch the branch to the stream branch as it seems to run much faster?
PS I couldn’t work out how to enable Blas but in the issues it seems like it makes little difference.

Also prob on a PI better to try the tiny.en model 1st

git fetch --all
git checkout stream
git reset --hard origin/stream

make clean
make stream

./stream -m ./models/ggml-tiny.en.bin -t 4 --step 4000 --length 8000

But also using that repo seems almost 3x faster with non stream
make tiny.en

I reserve 1/3 of memory for a zram compressed swap system within the OpenVoiceOS system. Probably the same approach as what you linked.

Anyhow, running it just quickly fill up the memory and swap and then kills itself.

Combined benchmarks table

mycroft@OpenVoiceOS-e3830c:~ $ ./bench-all.sh 4
Usage: ./bench.sh [n_threads]

Running benchmark for the tiny and base models
This can take a while!

| CPU | OS | Config | Model | Threads | Load [ms] | Encode [ms] |
| --- | -- | ------ | ----- | ------- | --------- | ----------- |
| Raspberry Pi 4 | OpenVoiceOS |  NEON | tiny | 4 | 835.68 | 21509.08 |
| Raspberry Pi 4 | OpenVoiceOS |  NEON | base | 4 | 1183.47 | 55256.20 |


mycroft@OpenVoiceOS-e3830c:~ $ ./bench-all.sh 1
Usage: ./bench.sh [n_threads]

Running benchmark for the tiny and base models
This can take a while!

| CPU | OS | Config | Model | Threads | Load [ms] | Encode [ms] |
| --- | -- | ------ | ----- | ------- | --------- | ----------- |
| Raspberry Pi 4 | OpenVoiceOS |  NEON | tiny | 1 | 861.34 | 29428.21 |
| Raspberry Pi 4 | OpenVoiceOS |  NEON | base | 1 | 1146.02 | 87615.00 |
CPU OS Config Model Threads Load [ms] Encode [ms]
RK3588 Ubuntu20.04 NEON tiny.en 4 243.54 ms 779.49 ms
RK3588 Ubuntu20.04 NEON base.en 4 316.52 ms 1821.06 ms
RK3588 Ubuntu20.04 NEON small.en 4 618.93 ms 7117.69 ms
RK3588 Ubuntu20.04 NEON medium.en 4 1514.88 ms 24139.92 ms
CPU OS Config Model Threads Load [ms] Encode [ms]
RK3588 Ubuntu20.04 NEON tiny 4 233.86 ms 791.01 ms
RK3588 Ubuntu20.04 NEON base 4 297.93 ms 1813.69 ms
RK3588 Ubuntu20.04 NEON small 4 592.18 ms 7102.28 ms
RK3588 Ubuntu20.04 NEON medium 4 1587.36 ms 24147.87 ms
CPU OS Config Model Threads Load [ms] Encode [ms]
RK3588 Ubuntu20.04 NEON tiny 8 226.48 ms 740.34 ms
RK3588 Ubuntu20.04 NEON base 8 300.48 ms 1723.42 ms
RK3588 Ubuntu20.04 NEON small 8 620.58 ms 6392.47 ms
RK3588 Ubuntu20.04 NEON medium 8 1533.75 ms 21899.08 ms

I still haven’t worked out the little(0-3).Big(4-7) on this thing as if I pin to big cores taskset -c 4-7

CPU OS Config Model Threads Load [ms] Encode [ms]
RK3588 Ubuntu20.04 NEON tiny.en 4 234.14 ms 681.53 ms
RK3588 Ubuntu20.04 NEON base.en 4 297.08 ms 1679.75 ms
RK3588 Ubuntu20.04 NEON small.en 4 599.98 ms 6867.66 ms
RK3588 Ubuntu20.04 NEON medium.en 4 1492.73 ms 23600.45 ms

I tried to compile with openBlas but seemed to kill the make


From the master repo as didn’t think about the repo after trying stream

CPU OS Config Model Threads Load [ms] Encode [ms]
RK3588 Ubuntu20.04 NEON tiny 8 226.48 ms 2681.05 ms
RK3588 Ubuntu20.04 NEON base 8 283.56 ms 6132.44 ms
RK3588 Ubuntu20.04 NEON small 8 583.39 ms 24397.78 ms
RK3588 Ubuntu20.04 NEON medium 8 1490.98 85099.45 ms

I think all the stream repo does is cut the ‘ctc’ width from 30 sec to 10 sec. So on a short command sentance its much faster whilst it also is slightly faster with longer transcription.
Also in the overall timing you have the load time but that will only happen on 1st load so if it stays resident in memory you can discount that.
I would give tiny.en a go with the stream mode.
Should of prob loaded from eMMC or NVME but SD is like for like.

As it is all run on CPU obviously having more cycles available influences the results. Below table represents the different information a bit better.

With the OVOS services running represents performance when run with the STT plugin (Although the python bindings and library way might still alter the data a bit. Not sure for the better or for the worse).

Without the OVOS services running is more for comparison with the other data out there.

CPU OS Config Model Threads Load [ms] Encode [ms] Remarks
Raspberry Pi 4 - 2GB OpenVoiceOS NEON tiny 1 861.34 29428.21 With OVOS services running
Raspberry Pi 4 - 2GB OpenVoiceOS NEON tiny 4 835.68 21509.08 With OVOS services running
Raspberry Pi 4 - 2GB OpenVoiceOS NEON base 1 1146.02 87615.00 With OVOS services running
Raspberry Pi 4 - 2GB OpenVoiceOS NEON base 4 1183.47 55256.20 With OVOS services running
Raspberry Pi 4 - 2GB OpenVoiceOS NEON tiny 1 752.64 24018.10 Without OVOS services running
Raspberry Pi 4 - 2GB OpenVoiceOS NEON tiny 4 743.37 10122.80 Without OVOS services running
Raspberry Pi 4 - 2GB OpenVoiceOS NEON base 1 974.46 71587.61 Without OVOS services running
Raspberry Pi 4 - 2GB OpenVoiceOS NEON base 4 982.24 24814.62 Without OVOS services running

As you can see, especially with using all 4 threads this makes a difference. No so much with 1 thread as the other 3 are still available for the whole OVOS software stack.

2 Likes

Its a shame really as Whisper steps up a notch on the small model that ‘turn on the light’ for me is perfect in the tiny model with the faster stream repo.
Others though for me
tiny.en → [00:00:00.000 → 00:00:05.760] setting a loan for Monday at 9 o’clock [832.27 ms]
base.en → [00:00:00.000 → 00:00:05.800] setting alarm for Monday at 9 o’clock. [ 2238.11 ms]
small.en → [00:00:00.000 → 00:00:05.800] Set an alarm for Monday at 9 o’clock [6852.66 ms]

small.en steps up a notch and gets it perfect, but is about x1.5 realtime of the same 4.4 sec sample.