Features my ideal voice assistant would include:

I’m really happy to see that this project exists. My current setup that includes 7 Echo devices scattered around my house always keeps the thought at the back of my mind that I’ve invited a huge company into my home to observe my every move (even if only through determining where I am in my house through sounds, or if I am actually at home at all).
The flexibility of Mycroft is also a huge advantage, in my opinion. Just the fact that we can change the voice, with more options than we could wish for is a big step in the right direction.
Now I’m dreaming of my future setup, and if it may be possible to achieve with Mycroft. The first part will be relatively simple – in-ceiling speakers throughout the house and a multi-zone amplifier, using a mixer to include voice responses. The other parts are probably very possible, but dependent on the developers to decide if it’s worth their time.

  1. A way to have multiple devices like Pi Zeros (with a good mic attached) distributed around the house, each able to either transmit the audio or process speech to text and then pipe that input to a main server with a payload to identify which device it’s from, which would then act on those commands. So basically the Pi Zeros would be like Echo devices but instead of contacting Amazon over the cloud, it would command your Mycroft server on your LAN.
  2. A way to toggle a mode where everything you say is processed to see if it is a command with no need to use a wake word. This would ideally be used when you are at home alone so the load on the server would not be too high as it tries to determine what is actually addressed to it. I live by myself and I think it’s annoying that I’m always having to say “Computer, turn on the light. Computer, what’s the weather like? Computer, what time is it?”. The only other time I speak is to speak to my cat and it should be pretty easy for the system to decide that something like “Good boy!” is not a system command. The fact that the system is processing everything would not be an issue if nothing ever leaves the LAN.
    Any thoughts? Am I only dreaming here or does any of this sound plausible in the (near) future?

Client server just makes sense and I have been testing various bits and pieces for a couple of months still need to populate my repo GitHub - StuartIanNaylor/ProjectEars: ProjectEars
I have been adamant for a long time that distributed wireless mic arrays should be like any HMI (Human machine interface) agnostic to system and interoperable with all.
I struggled with PiZero but with the PiZero2 there is so much more you can do.
ProjectEars because you have to name a repo is a client/server KWS system that uses snapcast for audio delivery to a wireless audio system for audio out.
But you get the idea the system is split into zones with members assigning to a channel or all of a zone.
Snapcast is wonderful for that and ultra-lite and already garners a good comunity and is a great piece of opensource.
But mentioned as the KWS mic system is just a reflection but uses websockets to transmit an opus stream to the KWS server but same style zone settings for multi mics.
The compressed audio binary is presented with a text comms protocol where the last keyword sensitivity sets the current zone member mic for that ASR sentence.

The only restrictions to interoperability and being system agnostic is that a zone must contain the same microphone system as that has huge effect on the KWS hit value passed back.

You can get some very cheap audio amps DC 12V 24V Single Channel TPA3116 Digital Audio Amplifier Board 100W BTL Out High Power Amplifier Board|Instrument Parts & Accessories| - AliExpress (says 100w but for me prob 30watt into 4ohm 24vDC) but there is a huge range avail and an ebay of a pair of 2nd user bookshelf speakers and applying your zero2 & amp to the back can also work really well.
Having sperate speakers to mics is a huge advantage as the engineering in smart assistants to isolate mics is an enormous task.
If you ever have a look at a Google Nest Audio its built like Robocop purely to limit audio resonance.
Google Nest Audio Teardown - iFixit

I have a range of software beamformer which is what I have been playing with to keep costs down with the lowest cost being 2 or single mic versions utilising cheap soundcards usb or the 2mic hat we all know.
2x ADC is a rarity and the best lowest cost one seems to of reached EOL Enermax AP001E DreamBass USB Soundcard as for $10 it had a 2 channel 96Khz ADC that you could straight wire 2x electrets and mount in a 10mm grommet.
So my search for stereo usb ADC cards continues think ADA-17 USB - HQ MINI audio | Axagon say its stereo which means I can do a simple endfire beamformer where the 2mic hat has likely far too much spacing.
Anyway I am working on a C++ beamforming util that can accommodate up to x6 mics via software but for reference this is a hugely great source of info for mic arrays.

In a distributed wireless array you can simply use single a single mic but distributed arrays will work even better with broadsides only attenuating the sides whilst endfire sides and rear.
You can also get more complex hybrid & 3d arrays where several mics will give room coverage where at least one should always be able to provide a capable signal with low noise.

As for a toggle mode and constant broadcast the KW is very important as it selects the best Mic which locks beamformers/filters on voice and stops false positives and is usually unique to emphasise so.
What google & amazon do is have an extended period after a recoginsed KW and intent sentence where subsequent sentences can be issued without KW and after a timeout it resets to needing KW.
It is relatively short though so that particular Mic source doesn’t get stuck as default and the reset allows what ever is the next best source.

A voice assistant system is inherently client/server as its hugely beneficial to share a more powerful central server than singular duplicate peer2peer hardware and the nature is that commands are very infrequent and often the hardware is sat waiting redundant.
Because of the diversification of use core count can likely service far more users as clashes are uncommon and queue length of a voice sentence is short.

There is no reason for any distributed array system to be proprietary and I have no interest in ASR/NLU/Skill servers and hence Project Ears is just a client/server distributed KWS that can sit infront of any and if anyone is interested in joining in then please do as I am really against the idea of proprietary systems purely for the purposes of branding and ownership only.
Hopefully I will be able to operate a Mycroft central server as I can also do others so my Mic array sensors will work with all just as in you don’t buy a Mycroft keyboard and mouse.
Most off the shelf mic arrays have pretty lousy mic spacing a geometry strangely where aliasing is likely to be very prominent and check the invensense app note mentioned above. Why mic arrays don’t have snap-offs with dupont connectors also is a strange omission to dictating a single mic geometry that is often wrong.

PS the Wondom boards even if a bad name are pretty good even if a bit more expensive.

PS Google have beaten us again though as with Google Nest Devices with the large iPad-like screen the cam is rolling all the time so you don’t have to say KW it works out you are talking to it so you can issue commands without KW.


Nice! Thanks for the great, in-depth response. Looks like I have some research to do. I’ll check out those links when I have the chance and consider my next move carefully. I was pretty keen on the centralized multi-zone amp so that I could automate it with Home Assistant and sync audio in different zones, so this gives me a lot to think about. Good stuff!!

Is an amazing resource and even though delay sum / delay invert sum are not extremely good methods they are OK and very easy to accommodate with load on a PiZ2 which @ $15 would likely be my goto for cheapest wireless mic/speaker baseboard.
In fact all the load is TDOA (direction of arrival) but also trying to find better algs that will run on the Z2 which is just a slightly lower clocked Pi3.
If you do read the above then for me prob the best would be to create a 2mic endfire that just inverts and sums with the delay of the mic spacing so creates attenuation at the rear and some on the sides.
The cheap respeaker like hats look like the spacing is far too much for that and will create aliasing which is such a shame as with the PiZ2 that would create an extremely cheap combo.
I think the mems on those could well be analogue but hacking smd electronics with my old eyes is something I try to dodge.

It is really beneficial to have mic that is directional that covers a certain area and attenuates rear and sides as omnidirectional (all directions) can easily be flooded with noise, but you could just have a single omni if close enough.
Linux Alsa has audio loopbacks sndaloop that you can play on an input and the other side of the loopback becomes and output.
So if you can install on any ASR host it will just look like a normal mic input.
Only prob is PiZ2 seems to be out of stock everywhere as Raspberry do there usual of a big production run when the time is right for them, so next time I might stock up with an extra 1 or 2 :slight_smile:

You’re obviously a lot more technologically inclined than I am. I think if I were to commit to a project like this, I would be willing to spend a little extra in order to get the best possible results. One reason I want to graduate to this different hardware is because I’m fed up with the poor results I get from the Echos and always having to repeat myself.
I haven’t had a chance to read up on everything you’ve mentioned to me yet, but from what I understand, the speakers attached to the listening device is so that it can more easily filter out music, etc. from what the mic is listening for? Would it be possible then to just run something from the speaker wires from a centralized amp to reduce that signal to line level and input that to the device? That way you have the best of both worlds.

I have to be honest when it comes to recognition opensource is not at the level of Amazon and Google that is farther ahead.
Google rule the roost and the nest audio as I got x2 echo gen4 because of the 3.5mm so also could use as active speakers and still shocked how bad recognition is compared to google.
The gen3 echo is supposedly better which is strange but gen4 isn’t good.

If you want more privacy or your just a tech geek who likes to play then here is the place to be but if its accuracy really here doesn’t compete with big data.
I do quite a lot of testing and that is just honesty.

I went with Echos because of the 3.5mm jack as well. What we really need is perfect speech recognition and GPT-3 integration! :slight_smile:

1 Like

Yeah the 3.5mm seemed like a good idea and they are also my computer speakers and also when playing with stuff on here.
I can honestly say the Google nest audio is so much better for ‘barge in’ & recognition and they seem to have everything as there new offline ASR model on their pixel phones is absolutely tiny runs in less than 5 watts and think it beats GPT-3 or is up there as what they have done with there new models is just incredible in a mobile footprint, maybe doesn’t beat GPT-3 but it is up there with current state of art and tiny.
Google do something a bit different to Amazon with ‘voice match’ which takes a short spectral finger print of your voice and it does add much accuracy but they have upped that on there phones but you have to buy a Pixel to get access…

Give here a try get a Piz2 or Pi42gb and a 2mic hat as you will find it interesting seeing the innards but doubt you will find an improvement maybe just new found respect that you echo aint that bad :slight_smile:

I have a Google Pixel 5 and I haven’t been impressed with its voice recognition. Maybe the newer hardware is better.
Aside from the lack of a 3.5mm jack, the other reason I decided not to go with Google Home was the stupid wake words. “OK Google” isn’t a tongue-twister, but it doesn’t exactly roll off the tongue either, plus I hate advertising, so advertising to myself all day is not going to fly.

Nope its the new Pixel 6 that has the new singing and dancing ASR running on their Tensor TPU.
Yeah its a pain being stuck with such limited wake words but guess they have built up such a huge dataset on those and also like to advertise :slight_smile: