Listening back through my Alexa commands over the past few days, I keep coming back to one where I ask the Echo Show to set a timer for four minutes and instead it comes back with twelve. The difference between four and twelve when you’re cooking is well done vs inanimate carbon rod.
I’m wondering if there’s some room for vision in this scenario. During the playback, I can hear my daughter talking and speaking over me. Sure, one could apply beam forming but this is relatively difficult and I’ve found the Echo Show (at least the first generation) to not perform as well as other Alexa devices. The microphone array is laid out differently and faces the back of the device.
However, the Echo Show has a camera. What if it were to apply edge based computer vision to at least hone the beam on where the sound is come from? Even better, what if it could also apply lip reading to the mix to enhance the speech recognition?
We had seen some movement over the past year in speech recognition via lip reading. Isn’t it time it made it to a device?