With beamforming / EC the STT accuracy becomes abysmal

puchatek · April 12, 2022, 11:11am

When I activate beamforming for my PS3 Eye via the module-echo-cancel module the recognition rate of the STT becomes unacceptable. I have the right mic geometry for the PS3 Eye set up, too.

When I record with and without beamforming and compare the recordings i can hear that with BF the sound is a little muffled but as a human I can pick up the words without a problem. I’m running a local instance of deepspeech and it works well enough without beamforming - when there’s no background noise or music playing. I can imagine that the artifacts I hear with beamforming are throwing off the STT engine but I haven’t seen a single post from other users mentioning the same problem.

Could it be something that Google’s acoustic models handle gracefully but Mozilla’s do not?
Is there something else I need to do to make STT work with beamforming / echo-cancellation?

synesthesiam · April 12, 2022, 6:30pm

By “Mozilla” do you mean the DeepSpeech or Coqui STT models? If so, those are usually trained on CommonVoice data, which is fairly noisy. It may simply be the case that cleaner data performs worse due to the expectation of noise.

It would be possible to test this hypothesis by training a model from scratch with clean data, or trying to fine-tune an existing model. Librivox might be a good source of cleaner data than CommonVoice.

StuartIanNaylor · April 12, 2022, 9:33pm

You will not be beamforming for long anyway as its dropped upstream and will be no longer.

The beamforming if you set it up right as the default points left or right (forgot) works sort of OK but it has no method to update co-ords as its part of the pulseaudio config.
The number of mics you have dictate how narrow a beam you can create and there is a really good invensense datasheet that goes through all the basics to a pretty good level of beamforming technology that I always quote https://invensense.tdk.com/wp-content/uploads/2015/02/Microphone-Array-Beamforming.pdf

The webrtc modules have there basis in Chrome & Chromebooks where the beamformer & aec had much focus on a static dual mic for someone in close proximity with AEC stopping feedback and a NS that also reduced keyboard clicks.
It worked quite well for what it was intended but really it pretty useless and depreciated as its being dropped which might give hint to its worth.

I do have a Delay Sum repo GitHub - robin1001/beamforming that I have been meaning to hack/molest and change the wavreader to a stdin stream.
My C is still non existent and a Delay Sum aint great but its the best compromise for load.
It needs to be C as Python is a great lang but for DSP it sucks as the interprocess of swapping small chunks at audio frequencies is just about the worst thing you can do with Python and its back end C libs.

PS3 eye is a great quad mic but the great algs to use it sadly where in the PS3 and we just don’t have anything and haven’t had for a long time as been banging on about this for what seems forever.
I have no idea why its recommenced whilst its known we are missing any decent algs such as beamforming and its TDOA ( Time Difference Of Arrival) to direct its beam to as if you have a KWS then you can do some sexy things like direct on voice.

But why you are getting bad results might not be anything to do with the STT because you are completely blind to what your STT actually received.
Try this in a simple venv as it gives simple max volume but also captures KW so you can actually play in audacity and see what your capturing its a ‘Hey Marvin’ but a simple script that is easy to see what is going on.
I haven’t looked at Mycroft for quite a while now but maybe someone can advise on how to capture the incoming audio stream as often it isn’t what or as good you think it is.

https://drive.google.com/file/d/1EFT4T0sxyVo9EXWMh-V0BL4QWXAFfVlE/view?usp=sharing

you can run that as its a simple script without the whole confusion of a system or at least check a simple VU arecord -D plughw:1 -V mono -r 16000 -f S16_LE -c 1 /dev/null

Also the PS3 eye drivers do work but don’t think it has a input volume? I always used the cnxsoft guide

AGC is usually essential to your input as with a room mic and how sound attenuates greatly by distance you can not set it with a static volume.
Also often the input volume is just too low and people set to 100% as that is a really bad idea as it leaves zero headroom and cause clipping that sends resonant frequencies galore just like a distortion pedal that does the same.

If you haven’t got hardware AGC use speex and can share my /etc/asound.conf

#pcm default to allow auto software plughw converion
pcm.!default {
  type asym
  playback.pcm "play"
  capture.pcm "cap"
}

ctl.!default {
  type hw card 1
}
ctl.equal {
  type equal;
}
pcm.plugequal {
  type equal;
  slave.pcm "plughw:1,0";
}
pcm.equal {
  type plug;
  slave.pcm plugequal;
}

#pcm is pluhw so auto software conversion can take place
#pcm hw: is direct and faster but likely will not support sampling rate
pcm.play {
  type plug
  slave {
    pcm "plughw:1,0"
  }
}

#pcm is pluhw so auto software conversion can take place
#pcm hw: is direct and faster but likely will not support sampling rate
pcm.cap {
  type plug
  slave {
    pcm "plugequal"
    }
}

pcm.agc {
 type speex
 slave.pcm "cap"
 agc on
 agc_level 2000
 denoise off
}


#sudo apt-get install asound2-plugins
#will use lower load but poorer linear resampling otherwise
defaults.pcm.rate_converter "speexrate"

In the above I have hardware AGC capture.pcm “cap” but to enable agc just change that line to capture.pcm “agc”.
PS yeah I also have alsa eq acting as a voice bandpass.

For some reason Debian still uses the RC of Speex even though years old but Alsaplugins rightly uses the release version so its doesn’t see it because Debian hasn’t updated and the Speex plugins don’t get installed on Buster.
Might be fixed on Bullseye but haven’t checked its a very easy and short compile and install and if you struggle ask I will give you a quick howto.
This one was for Buster GitHub - StuartIanNaylor/Alsa-plugins-speex-update and make sure you run the right one for 32 or 64bit.
aplay --version will show you what version of alsa to aim at as prob need to update for Bullseye

PS don’t use the denoise as it sucks worse than Rnnnoise and is artefact city, AGC is great though and maybe a bit too high the default of 8000 which relates to max gain is just crazy I use 2000 but maybe should be lower depends on variation of distance in use.

You can use the AEC and AGC of pulseaudio but again the AEC is pretty poor as it does cancel but completely fails as SNR gets to fairly modest levels. Speex AEC attenuates and continues to do so and you can use it with pulse also.
I know alsa and for single use system audio don’t see the point in pulseaudio even though good for its uses the default AEC & beamforming module is prob its worst.
I have been trying to get my head round pipewire as its AEC just received a fresh rewrite.

StuartIanNaylor · April 19, 2022, 12:44pm

If you want a beamfomer I have 1 here that I am doing for my project ears thing.
It will have to be gdrive as my phone died and locked out of github 2 factor auth until my new one arrives.

https://drive.google.com/file/d/1K0TMHi9TpyIbmCydz0A6peBDs3MSU2ND/view?usp=sharing

The beamformer works fine the argparser that I inherited haven’t totally worked out yet but its ./ds input_device=1 output_device=1
Run it and press enter to stop and it will list params and devices.

I will set up a github repo when my phones turns up.
Speed of sound is 343ms which with 48k input means a single sample = 7.145mm my custom mic is a stereo 95mm affair but dropping sample rate or smaller mic arrays will get less resolution and that is what the margin is as it just stop spurious errors guess really should have it set to 14 as 95/7 is max.

Its not really what I am going to use in project ears as it will be embedded into the code this is just an external one in C++ and my 1st bit of C++ hacking.

I just changed from file to a streaming alsa interface thanks to portaudio so thanks to

If you where going to use set the output to a loopback and then the other side will act as a mic source of the beamformed multichannel mic set as source.

Playing with ALSA loopback devices | Playing with Systems gives a decent guide but basicall whatever you play into a sink side becomes available as a source on the other.
So ramp up sample rate to max and use a plughw: to auto resample from loopback as then you will get resolution needed.

puchatek · April 19, 2022, 8:02pm

Thanks for all the input. Guess there’s no quick fix for my issue

StuartIanNaylor · April 19, 2022, 8:17pm

If you want a fixed beamformer then whilst its still in pulseaudio it does work but like I say the AEC is also pretty flakey above a certain threshold.
Or use what I just posted but really beamforming only works well in conjunction with a method to beamform to.
Otherwise it has a tendency to jump around to the loudest signal as in the standalone version I posted but don’t know of anything quicker than ./ds --input_device=0 --output_device=0 --sample_rate=48000 --frames=4800 --margin=15 just remark out the printf commands so the cli stays clean with // and make or just run as supplied.
Its prob why its a fixed beamformer (pulse audio) as all the load I have found is the TDOA ( Time Difference Of Arrival) and the beamformer is super lite but maybe they used a more complex method but never did get my head round the code.

But yeah and beamforming as said isn’t everything as its just a small increment in a array of options to get clean voice and lower SNR.

I just don’t think you have the settings right in pulse audio as it does work and you haven’t posted the default.pa entries or a captured wav so can not say.

I remember the mic geometry mic_geometry=-0.03,0,0,-0.01,0,0,0.01,0,0,0.03,0,0 but I have forget the azimuth setting target_direction=a,e,r as think it did turn out to be degrees but the fog of time don’t remember the orientation.

At least with my homebrew beamformer you just have specify number of channels
* 0 radians azimuth is to the right of the array, and positive angles * move in a counter-clockwise direction.
is from the pulse audio code but think it did turn out to be degrees but try both.

Phone turned up GitHub - StuartIanNaylor/ProjectEars: ProjectEars

puchatek · April 23, 2022, 2:58pm

I just don’t think you have the settings right in pulse audio as it does work and you haven’t posted the default.pa entries or a captured wav so can not say.

To be clear: the beam forming seems to work.I’m using beam forming settings specific to the PSE eye. I can hear some artifacts in the result but in general it does what is supposed to do. Background music is filtered out.

It’s the STT part that is the problem. I’ll try the Google STT with the recordings i made to see if it performs better than deep speech with this kind of processed audio.

StuartIanNaylor · April 23, 2022, 3:20pm

Yeah that is why I am saying as I have had the beamforming working but it didn’t effect STT or at least KWS (Foggy memory as over a year now).
It just seemed a bit pointless as it was fixed beam and with a broadside array with just an azimuth setting accepts as much from the rear as it does the front.
Its only good for filtering noise off to one side, just because an omidirection is pointing forward doesn’t mean it will reject from the rear it will just be reduced by length of travel as sound is a bit like ripples in water.
The beamforming and AEC didn’t seem to produce any prominent artefacts but if you have noise_suppression=1 just turn that off as yeah it will be artefact city and unless your STT model was trained on a dataset including those artefacts it will reduce accuracy considerably.
If it was trained with a dataset that had been processed with noise through the same system then likely it would be accurate.

Google STT will likely cope as there models are more complex and the system is more clever but sort of defeats the whole purpose of privacy though even though Googles latest ASR can be offline the Google STT here isn’t.

Simple noise suppression schemes are inherently full of artefacts and if you have a model trained on clean data likely they will kill recognition, plus they are relatively useless as only good for static noise whilst most noise such as media is dynamic in that its not constant such as a fan or air conditioning and the only noise suppression that works will with that is the likes of RTXvoice or when its not noise suppression as Google doesn’t bother to suppress the unknown they use targeted voice extraction and extract the voice hence why from android to smart speakers you have to do a short enrollment and it saves ‘voices’ as profiles.

PS I do remember also as it seems to be only on Arm that because your using a usb mic rather than a device with in/out audio on the same device and sharing the same clock the drift compensation just isn’t the same as it works on x86_64 seemed to be woeful and would constantly be re-syncing so that prob isn’t helping.

But anyway will leave you to it as you haven’t shared default.pa entries or a captured wav so can not say.