I have had some mixed experiences when it comes to Alexa calling. It seems like the placement of my Echo isn’t optimal for voice interaction with other humans. To those on the other end, I sound faint and it’s difficult for them to hear me.
Far field algorithms for digital signal processing for AI assistants are not the same that are used for speaker phonetelephony. Yes, they can both include beamforming and duplexing but the sample rate, the frequencies that are filtered, and the voice activity detection can vary significantly.
Conference call technology is designed to make parties audible to each other. The far field microphones used on the Amazon Echo or Google Home are designed for speech to text. the difference might seem subtle but it can have significant ramifications.
Several years ago, when we were designing our own DSP algorithms for the Ubi, we ended up improving the sound quality of the audio recordings that were heard by the device. However, this did not lead to better speech recognition performance. In fact, by removing artifacts that the speech-to-text service might have been looking for, it degraded performance.
In order for calling through far field voice first devices to be effective, they will need to change their DSP algorithms on the fly to work for teleconferencing. since the device still requires far field wake up word and speech-to-text function during a call, this could lead to poor performance on the voice assistant side. It’ll be exciting to see you how Amazon or Google developed in this challenging area.