Adding "Whisper" as local STT option?

The tiny and base do fit in my 2GB rpi4 model, the small just needs the 4GB as a minimum.

Going to test the tiny and base model with @JarbasAl his STT plugin in the upcoming days.

Try the stream repo as whatever it does even with normal file feed its x3 faster and works just the same on short command sentences.
I think the Pi4 is lacking in performance and falls a bit short and a bit short is actually a long way.
With Whisper that is.

Whisper is just a transformer by OPenAi and likely there will be more and maybe someone will split the model so SoC’s with AI accelerators or GPU’s can partition the model and share load with the CPU.
A lot of the NPU’s are int8 only and that only really supports fairly simple models.

Did you compare the streaming with the latest commit on the normal?

Isn’t the speed coming from the 2x speed commit the recently pushed?

Yeah I am thinking my Rk3588 is a tad short as the model you need to be running as minimum is the Small model really as that is when it gets near to some of what OPenAI claim. In fact prob quite a bit off and 3x+ what you will get on a Pi4.
The base & tiny model start to fall off a cliff with WER and yeah it will run them just about, but the PI4 is not particually good at running transformer models even very heavily optimised code and quantised models, its more of a celebration that it runs than a working ASR.
The Whisper model is a translation model based on a 30 sec window and sort of far from optimised for short sentence smartAI commands even though Georgi Gerganov has made some god level coding optimisation and model hack to get where we are.
It has some english only models also hcked in and as examples they produced some vasty smaller models but the published WER and acclaim is for the large model that scales quite well to the small and then goes on a bit of a nose dive.
If the training methods of Whisper was also open source then likely it would be a different story and Linux has its 1st commericial comparitive ASR.
Part of it is the way Whisper works which is far less on phonetic correctness and what has the highest matrix score over a 30 second window and it gets longer sentances far more correct than it does short ones as they contain so much more sentence logic.
Get a mic and give it a try with typical command sentences than maybe reading book like narrative.
The shorter the sentence and the smaller the model the worse it gets.

Did some more testing with OpenBLAS. For Aarch64 this brings again some performance gains. For completeness-sake the full table;

CPU OS Config Model Threads Load [ms] Encode [ms] Remarks
Raspberry Pi 4 - 2GB OpenVoiceOS NEON tiny 1 861.34 29428.21 With OVOS services running
Raspberry Pi 4 - 2GB OpenVoiceOS NEON BLAS tiny 1 843.80 16145.62 With OVOS services running
Raspberry Pi 4 - 2GB OpenVoiceOS NEON tiny 4 835.68 21509.08 With OVOS services running
Raspberry Pi 4 - 2GB OpenVoiceOS NEON BLAS tiny 4 824.24 13187.96 With OVOS services running
Raspberry Pi 4 - 2GB OpenVoiceOS NEON base 1 1146.02 87615.00 With OVOS services running
Raspberry Pi 4 - 2GB OpenVoiceOS NEON BLAS base 1 1103.39 52228.30 With OVOS services running
Raspberry Pi 4 - 2GB OpenVoiceOS NEON base 4 1183.47 55256.20 With OVOS services running
Raspberry Pi 4 - 2GB OpenVoiceOS NEON BLAS base 4 1161.32 29851.40 With OVOS services running
Raspberry Pi 4 - 2GB OpenVoiceOS NEON tiny 1 752.64 24018.10 Without OVOS services running
Raspberry Pi 4 - 2GB OpenVoiceOS NEON BLAS tiny 1 751.96 13082.95 Without OVOS services running
Raspberry Pi 4 - 2GB OpenVoiceOS NEON tiny 4 743.37 10122.80 Without OVOS services running
Raspberry Pi 4 - 2GB OpenVoiceOS NEON BLAS tiny 4 742.90 9564.89 Without OVOS services running
Raspberry Pi 4 - 2GB OpenVoiceOS NEON base 1 974.46 71587.61 Without OVOS services running
Raspberry Pi 4 - 2GB OpenVoiceOS NEON BLAS base 1 979.65 43852.07 Without OVOS services running
Raspberry Pi 4 - 2GB OpenVoiceOS NEON base 4 982.24 24814.62 Without OVOS services running
Raspberry Pi 4 - 2GB OpenVoiceOS NEON BLAS base 4 982.80 19910.19 Without OVOS services running

With the tiny model, close to 1x real time;

mycroft@OpenVoiceOS-e3830c:~/whispercpp $ ./main -m models/ggml-tiny.bin -f samples/jfk.wav
whisper_model_load: loading model from 'models/ggml-tiny.bin'
whisper_model_load: n_vocab       = 51865
whisper_model_load: n_audio_ctx   = 1500
whisper_model_load: n_audio_state = 384
whisper_model_load: n_audio_head  = 6
whisper_model_load: n_audio_layer = 4
whisper_model_load: n_text_ctx    = 448
whisper_model_load: n_text_state  = 384
whisper_model_load: n_text_head   = 6
whisper_model_load: n_text_layer  = 4
whisper_model_load: n_mels        = 80
whisper_model_load: f16           = 1
whisper_model_load: type          = 1
whisper_model_load: mem_required  = 390.00 MB
whisper_model_load: adding 1608 extra tokens
whisper_model_load: ggml ctx size =  73.58 MB
whisper_model_load: memory size =    11.41 MB
whisper_model_load: model size  =    73.54 MB

system_info: n_threads = 4 / 4 | AVX2 = 0 | AVX512 = 0 | NEON = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 |

main: processing 'samples/jfk.wav' (176000 samples, 11.0 sec), 4 threads, 1 processors, lang = en, task = transcribe, timestamps = 1 ...


[00:00:00.000 --> 00:00:11.000]   And so my fellow Americans ask not what your country can do for you ask what you can do for your country.


whisper_print_timings:     load time =   741.98 ms
whisper_print_timings:      mel time =   200.68 ms
whisper_print_timings:   sample time =    22.64 ms
whisper_print_timings:   encode time =  9588.57 ms / 2397.14 ms per layer
whisper_print_timings:   decode time =   716.64 ms / 179.16 ms per layer
whisper_print_timings:    total time = 11276.80 ms

With the base mode, twice as slow however a little bit better transcribe as it picked up a comma :smiley: (Could be by accident)

mycroft@OpenVoiceOS-e3830c:~/whispercpp $ ./main -m models/ggml-base.bin -f samples/jfk.wav
whisper_model_load: loading model from 'models/ggml-base.bin'
whisper_model_load: n_vocab       = 51865
whisper_model_load: n_audio_ctx   = 1500
whisper_model_load: n_audio_state = 512
whisper_model_load: n_audio_head  = 8
whisper_model_load: n_audio_layer = 6
whisper_model_load: n_text_ctx    = 448
whisper_model_load: n_text_state  = 512
whisper_model_load: n_text_head   = 8
whisper_model_load: n_text_layer  = 6
whisper_model_load: n_mels        = 80
whisper_model_load: f16           = 1
whisper_model_load: type          = 2
whisper_model_load: mem_required  = 506.00 MB
whisper_model_load: adding 1608 extra tokens
whisper_model_load: ggml ctx size = 140.60 MB
whisper_model_load: memory size =    22.83 MB
whisper_model_load: model size  =   140.54 MB

system_info: n_threads = 4 / 4 | AVX2 = 0 | AVX512 = 0 | NEON = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 |

main: processing 'samples/jfk.wav' (176000 samples, 11.0 sec), 4 threads, 1 processors, lang = en, task = transcribe, timestamps = 1 ...


[00:00:00.000 --> 00:00:07.600]   And so my fellow Americans ask not what your country can do for you,
[00:00:07.600 --> 00:00:10.600]   ask what you can do for your country.


whisper_print_timings:     load time =   978.30 ms
whisper_print_timings:      mel time =   198.97 ms
whisper_print_timings:   sample time =    28.10 ms
whisper_print_timings:   encode time = 20292.19 ms / 3382.03 ms per layer
whisper_print_timings:   decode time =  1504.07 ms / 250.68 ms per layer
whisper_print_timings:    total time = 23004.49 ms

Just about managed to get the small model into memory, so here it goes;

mycroft@OpenVoiceOS-e3830c:~/whispercpp $ ./main -m models/ggml-small.bin -f samples/jfk.wav
whisper_model_load: loading model from 'models/ggml-small.bin'
whisper_model_load: n_vocab       = 51865
whisper_model_load: n_audio_ctx   = 1500
whisper_model_load: n_audio_state = 768
whisper_model_load: n_audio_head  = 12
whisper_model_load: n_audio_layer = 12
whisper_model_load: n_text_ctx    = 448
whisper_model_load: n_text_state  = 768
whisper_model_load: n_text_head   = 12
whisper_model_load: n_text_layer  = 12
whisper_model_load: n_mels        = 80
whisper_model_load: f16           = 1
whisper_model_load: type          = 3
whisper_model_load: mem_required  = 1044.00 MB
whisper_model_load: adding 1608 extra tokens
whisper_model_load: ggml ctx size = 464.56 MB
whisper_model_load: memory size =    68.48 MB
whisper_model_load: model size  =   464.44 MB

system_info: n_threads = 4 / 4 | AVX2 = 0 | AVX512 = 0 | NEON = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 |

main: processing 'samples/jfk.wav' (176000 samples, 11.0 sec), 4 threads, 1 processors, lang = en, task = transcribe, timestamps = 1 ...


[00:00:00.000 --> 00:00:11.000]   And so my fellow Americans, ask not what your country can do for you, ask what you can do for your country.


whisper_print_timings:     load time =  4153.16 ms
whisper_print_timings:      mel time =   207.59 ms
whisper_print_timings:   sample time =    32.41 ms
whisper_print_timings:   encode time = 80229.52 ms / 6685.79 ms per layer
whisper_print_timings:   decode time =  3845.80 ms / 320.48 ms per layer
whisper_print_timings:    total time = 88472.29 ms
1 Like

Yeah I just don’t think the Pi4 has the Ooomf to run transformer models.

Intel have been doing optimisations Fast DistilBERT on CPUs if you google, but currently there are a lot of models out of range of a Pi4 or at least like Whisper the very small heavilly quantised models bare little accuracy to which the bigger models are famed for.
I have got a 6Tops NPU embedded into the Rk3588 that I think will run x3 faster at least than CPU if I can get a int8 quantised version of whisper.
When you get an embedded NPU sharing address and memory with the Soc it gets much faster than the Coral USB addons or at least much faster than many of the coral benchmarks I have seen.
For £70 the x4 that the Coral USB seems to give makes me wonder if its worth it. It is really hard to compare as how TOPs are accounted, quantisation and models differ but probably the small.en whisper model would run will on a int8 converted for a NPU of approx 4-6 tops.
On the 6tops npu of the rk3588 it produces approx 190 fps with resnet-18 with 224x224 images but what does that mean?
The power draw on the npu is much less though as 1.5watt peak so as well as being more performant its also cooler and more efficient than a cpu which running the same model is about 5 watts and the only real comparison I can make.

I am thinking Georgi Gerganov has already taken optimisation to the max and currently, it is use a GPU or keep your fingers crossed for a int8 npu version.

This looks interesting as its been converted to tflite but also looks like its been seperated into distinct functions.

Its tflite and by the looks of it much already converted to int8 so likely would run on a Coral accelerator or even partition a model between cpu & npu.

INFO: Created TensorFlow Lite XNNPACK delegate for CPU.
Inference time 2 seconds

[_SOT_][_NOT_] And so my fellow Americans ask not what your country can do for you, ask what you can do for your country.

PS has anyone tried the bigger models of Whisper as for some reason it not mentioned but you can have crazy SNR levels and the medium or large model compared to others by magic get transcription correct.

I have been playing with DeepFilterNet which is great but not much good for Whisper as it actually makes results worse.
If you have a model that you can train then likely you would have to preprocess the dataset with DeepFilterNet but quite likely in terms of noise do as well. OpenAi have kept the training of Whisper to themselves but you should try the larger models with noise as the results are outstanding.

Hi there,
I’m quite new to mycroft.
Can I hope that we see Introducing Whisper once on the Mark II - Mycroft
I’m thinking to buy one but still unsure as I have string NAS as well what may could handle “Whisper”.
thank you

@StuartIanNaylor Reaching out to you here as the question is very much related to all of the above.

Also looking into the INT8 conversed whisper model that runs on TFLite as I expect that is more of an option for low powered devices such as the RPI family. Thank you very much for the heads up about it.

I see you investigated the noise removal option DeepFilterNet and just like me a long time ago when I looked into RNNoise removal it was not really helping.

Anyhow the question: How/What did you do with the audio bitrate? Tensorflow Lite, Whisper, etc are all trained on and only support 16 kHz audio while all the noise reduction programs and models are trained and only supporting 48 kHz audio. I guess resampling will not do the whole process any good. Especially resampling from 48 kHz with artifacts after the noise removal into 16 kHz.

Resampling down is lossless its just upsampling where there are problems.
DeepFilterNet he only started developing a year ago with the 1st release Nov 5, 2021 and artefacts don’t matter but if using such a filter (stay away from RNNoise as its bad and that is very old, DTLN is OK as well) you need to create a dataset that have been cleaned by the filter not a clean dataset that most ASR have used or KWS then when you train the training learns the filter fingerprint (artefacts) as part of the training.

The MS DNS-Challenge has a quite good tool to add noise to a dataset

You would have to do all of Librispeech or something then run your filter on it then do your training and this is true with even hardware if you truly wanted it optimal.

So yeah downsampling is no problem and really neither are the artefacts its the models that are trained on clean voice in this scenario that are the problem.

Whisper on the bigger models has jaw dropping ability to filter out noise itself and if we could train it with Deepfilter it would likely be even more awesome or maybe something is essentially lost, but do not think so its just a cutting edge transformer model that would equally learn filtered voice but that is absent in its dataset.

People are trying to hack the whisper model as it was released as a binary and dunno id int8 will happen or not but NPU/GPU/CPU all become part of a ML processing unit on fairly low cost devices.
I have been wondering if Google Coral are going to update there devices but many good solutions are cheaper and faster being onboard and sharing memory.

ps @j1nx I have started again playing with KWS as my Rtx3050 aint really up for ASR training and its been a while and my MS plays havoc with my memory.
If you ever get the chance then check this out and tell me if Sounddevice is doing the same for you as sure it wasn’t like this but having to set a software gain after much headscratching why I wasn’t getting recognition.

https://drive.google.com/file/d/1m8-LvW9vpOG4iJVYUaOWRr-QKGuA1cl7/view?usp=share_link

That is just a tflite and just the simple code is here, but am I doing something stupid as once more after a break its like starting with a blank slate.

import tensorflow as tf
import sounddevice as sd
import numpy as np
import threading
  
def sd_callback(rec, frames, time, status):
    global gain, max_rec, kw_hit, kw_count, sample_rate, rec_duration
    # Notify if errors
    if status:
        print('Error:', status)
    
    rec = np.reshape(rec, (1, int(sample_rate * rec_duration)))
    rec = np.multiply(rec, gain)
    
    # Make prediction from model
    interpreter1.set_tensor(input_details1[0]['index'], rec)
    # set input states (index 1...)
    for s in range(1, len(input_details1)):
      interpreter1.set_tensor(input_details1[s]['index'], inputs1[s])
  
    interpreter1.invoke()
    output_data = interpreter1.get_tensor(output_details1[0]['index'])
    # get output states and set it back to input states
    # which will be fed in the next inference cycle
    for s in range(1, len(input_details1)):
      # The function `get_tensor()` returns a copy of the tensor data.
      # Use `tensor()` in order to get a pointer to the tensor.
      inputs1[s] = interpreter1.get_tensor(output_details1[s]['index'])
       
    lvl = np.max(np.abs(rec))
    if lvl > max_rec:
      max_rec = lvl
          
    if output_data[0][0] > 0.95:
      kw_hit = True
      kw_count += 1
      print("Marvin:", output_data[0][0], lvl)

    elif output_data[0][1] > 0.90:
        if kw_hit == True:
          print('Max lvl:', max_rec)
          kw_hit = False
          max_rec = 0.0
          if kw_count > 60:
            print('Hello Marvin', kw_count)
        kw_count = 0

        
# Parameters
rec_duration = 0.020
sample_rate = 16000
num_channels = 1

gain = 10.0
max_rec = 0.0
kw_hit = False
kw_count = 0
sd.default.latency= ('high', 'high')
sd.default.dtype= ('float32', 'float32')

# Load the TFLite model and allocate tensors.
interpreter1 = tf.lite.Interpreter(model_path="../GoogleKWS/models2/crnn/tflite_stream_state_external/stream_state_external.tflite")
#interpreter1 = tf.lite.Interpreter(model_path="../GoogleKWS/models2/bc_resnet_2/tflite_stream_state_external/stream_state_external.tflite")

interpreter1.allocate_tensors()

# Get input and output tensors.
input_details1 = interpreter1.get_input_details()
output_details1 = interpreter1.get_output_details()

inputs1 = []

for s in range(len(input_details1)):
  inputs1.append(np.zeros(input_details1[s]['shape'], dtype=np.float32))
    

# Start streaming from microphone
with sd.InputStream(channels=num_channels,
                    samplerate=sample_rate,
                    blocksize=int(sample_rate * rec_duration),
                    callback=sd_callback):
    threading.Event().wait()

Can you see how I have hacked it with gain = 10.0 is it same on your system or is it just me?

Going through your post a bit later, just a quicky;

Exactly. Most KWS systems are trained on 16 kHz and is is most likely useless to de-noise the continuous listening mode for KWS.

So run a continuous listening thread for KWS. Take precise-lite as example as we are on the Mycroft forums. For that we have to run our hardware in 16 kHz. As soon as we have KWS confirmation we will use VAD detection to know when we stopped speaking. For that we can use webrtc or Silero which both do a proper job. The latter performs best however needs onnx-runtime which I have not yet implemented.

Anyhow for all VAD detected audio we need to resample our mic up to 48 kHz to feed it into the denoise system and then straight back into 16 kHz for the STT system. I have not yet run some tests as I am still creating the whole image to start playing with it, but my gut feeling just says it is just way to many audio changes.

Just trying to find different software components to play with that are a bit more aligned to one hardware setting for the audio.

…getting back to you about the rest…

One of the biggest problems with KWS is noise dunno if you have tried DeepFilterNet but the ability to filter out voice where noise SNR is higher is actually really good.
DeepFilterNet could filter out 3party noise…

Silero looks quite interesting as IMO webrtc just sucks :slight_smile:
You can always use a loopback to resample as of all that is a very light process.

I am more confused about Arm Hardware though as when it comes to running ML annoyingly by far the most compatible is likely Android it has a NNAPI where things generally just work due to its mobile heritage whilst prob you are looking at vendor frameworks such as ArmNN & the like of RKNPU if that is your flavour which are a real pain as its yet another framework to convert to and its far from plain sailing.
There are things like GitHub - microsoft/onnxruntime-tvm: Open deep learning compiler stack for cpu, gpu and specialized accelerators that might run Mali and various over GPU’s but again its another framework that is a lot of work just to appraise as good or bad before you start and have searched several times for an elusive working Aarch64 NNAPI as its a Linux for after all!

I will have to have look at Silero as…

We report results for the following types of models:

  • FP32 (baseline)
  • FP32 + Fused (CE v1)
  • FP32 + INT8
  • FP32 Fused + INT8
  • Full INT8 + Fused (EE, small)
  • Best / xsmall (EE, xsmall, quantized, compiled, further improved and optimized)
  • xxsmall - cutting edge model, used in EE distros

Might see hwo the RK3588 goes purely for interest

Maybe another to pass as inference is quick but quality with that model…

morning this tunesity is election day and months spirited deb and igous campaigning that time is come for americans to make important decisions about our nations’ future and courage all americans to go tothe polls and vote electctionseason brings out the spirit of competition between our political parties and that competition is an central part of althy democracy but as the campaigns come to a closes republicans democrats and independence can find common ground on least one point our system of represented democracy is one of americ’ great est strength the united states was found on the beli all men created equal every election day millions of americans of all races religions and background step and voting boo throughout the nion whether they are richer poor old or young each of them and an equal chare and choosing the path that our country will take and every b let they cast is reminder that our founding principles are alive and well voting is one the great privileges of americans citizenship it is always required brave de fenders and you head to the polls next week remember the sacrifices that been made by generations of americans and uniform to preserve our way of life from bucker hild to bag dead that men and women of americansan ar forces have been devoted guardians of our democracy all of so them and their families a special de of atitude on election day americans should also remember the important example that our election set throughout the world the young democracies from georg and ukraine toafghanistan and iraq and look to the united states for proof that self government can andendure and nations that still of aranian and ression can find ho and inspiration in our commitment to libery for more than two centuries americans of demonstrated the ability of free people to choose their own leaders our nation has flourished because of its commitment to trusting the wisdom of our citizen ry and this year’ election we will see this tradition continue and we will be reminded once again that we are bless to live of free nation guided by the will of the people thank you for listening
11.876678466796875 x31.78

wget --quiet --show-progress -O samples/gb0.ogg https://upload.wikimedia.org/wikipedia/commons/2/22/George_W._Bush%27s_weekly_radio_address_%28November_1%2C_2008%29.oga