My theory is that humans have an amazing ability to use acoustics to generate sounds using all sorts of natural effects like reverberation and resonance. These can add depth and warmth to music and the spoken word. However, all of human intuition falls apart when we need to convert analog to digital and understand how audio works in the digital domain. It becomes a very unintuitive process.
Case in point is the amount of microphones for far field devices. If we had more ears, could humans hear better? We apply magic to the “D” of DSP and assume it should be capable of pulling off amazing fetes. (It’s the algorithm tested against a specific STT with specific microphones and measuring word error rate that can tell whether the combination is good and really nothing else).
Another misconception is around microphone separation distances. We think a larger array means more far field (maybe this is a cultural bias that was formed by the SETI program and the use of a radio telescopes located thousands of kilometers apart) but there is an optimization of distance, orientation, and geometry that is unique to each DSP algorithm. Sometimes this optimization is known and programmed into the DSP code and sometimes it needs to be determined empirically.
The combination of the large number of variables of audio processing and the physics of audio means that there are many little things that can go wrong in building a far field voice device.