Resampling down is lossless its just upsampling where there are problems.
DeepFilterNet he only started developing a year ago with the 1st release Nov 5, 2021 and artefacts don’t matter but if using such a filter (stay away from RNNoise as its bad and that is very old, DTLN is OK as well) you need to create a dataset that have been cleaned by the filter not a clean dataset that most ASR have used or KWS then when you train the training learns the filter fingerprint (artefacts) as part of the training.
The MS DNS-Challenge has a quite good tool to add noise to a dataset
You would have to do all of Librispeech or something then run your filter on it then do your training and this is true with even hardware if you truly wanted it optimal.
So yeah downsampling is no problem and really neither are the artefacts its the models that are trained on clean voice in this scenario that are the problem.
Whisper on the bigger models has jaw dropping ability to filter out noise itself and if we could train it with Deepfilter it would likely be even more awesome or maybe something is essentially lost, but do not think so its just a cutting edge transformer model that would equally learn filtered voice but that is absent in its dataset.
People are trying to hack the whisper model as it was released as a binary and dunno id int8 will happen or not but NPU/GPU/CPU all become part of a ML processing unit on fairly low cost devices.
I have been wondering if Google Coral are going to update there devices but many good solutions are cheaper and faster being onboard and sharing memory.
ps @j1nx I have started again playing with KWS as my Rtx3050 aint really up for ASR training and its been a while and my MS plays havoc with my memory.
If you ever get the chance then check this out and tell me if Sounddevice is doing the same for you as sure it wasn’t like this but having to set a software gain after much headscratching why I wasn’t getting recognition.
https://drive.google.com/file/d/1m8-LvW9vpOG4iJVYUaOWRr-QKGuA1cl7/view?usp=share_link
That is just a tflite and just the simple code is here, but am I doing something stupid as once more after a break its like starting with a blank slate.
import tensorflow as tf
import sounddevice as sd
import numpy as np
import threading
def sd_callback(rec, frames, time, status):
global gain, max_rec, kw_hit, kw_count, sample_rate, rec_duration
# Notify if errors
if status:
print('Error:', status)
rec = np.reshape(rec, (1, int(sample_rate * rec_duration)))
rec = np.multiply(rec, gain)
# Make prediction from model
interpreter1.set_tensor(input_details1[0]['index'], rec)
# set input states (index 1...)
for s in range(1, len(input_details1)):
interpreter1.set_tensor(input_details1[s]['index'], inputs1[s])
interpreter1.invoke()
output_data = interpreter1.get_tensor(output_details1[0]['index'])
# get output states and set it back to input states
# which will be fed in the next inference cycle
for s in range(1, len(input_details1)):
# The function `get_tensor()` returns a copy of the tensor data.
# Use `tensor()` in order to get a pointer to the tensor.
inputs1[s] = interpreter1.get_tensor(output_details1[s]['index'])
lvl = np.max(np.abs(rec))
if lvl > max_rec:
max_rec = lvl
if output_data[0][0] > 0.95:
kw_hit = True
kw_count += 1
print("Marvin:", output_data[0][0], lvl)
elif output_data[0][1] > 0.90:
if kw_hit == True:
print('Max lvl:', max_rec)
kw_hit = False
max_rec = 0.0
if kw_count > 60:
print('Hello Marvin', kw_count)
kw_count = 0
# Parameters
rec_duration = 0.020
sample_rate = 16000
num_channels = 1
gain = 10.0
max_rec = 0.0
kw_hit = False
kw_count = 0
sd.default.latency= ('high', 'high')
sd.default.dtype= ('float32', 'float32')
# Load the TFLite model and allocate tensors.
interpreter1 = tf.lite.Interpreter(model_path="../GoogleKWS/models2/crnn/tflite_stream_state_external/stream_state_external.tflite")
#interpreter1 = tf.lite.Interpreter(model_path="../GoogleKWS/models2/bc_resnet_2/tflite_stream_state_external/stream_state_external.tflite")
interpreter1.allocate_tensors()
# Get input and output tensors.
input_details1 = interpreter1.get_input_details()
output_details1 = interpreter1.get_output_details()
inputs1 = []
for s in range(len(input_details1)):
inputs1.append(np.zeros(input_details1[s]['shape'], dtype=np.float32))
# Start streaming from microphone
with sd.InputStream(channels=num_channels,
samplerate=sample_rate,
blocksize=int(sample_rate * rec_duration),
callback=sd_callback):
threading.Event().wait()
Can you see how I have hacked it with gain = 10.0 is it same on your system or is it just me?