I talk about voice technologies a lot. Especially over the phone and especially when at the office. The issue is that I have many voice devices and they often trigger even when I’m not referring to them.
We’re able to see a few new generations of voice triggering software advance to deal with issues of a device going off when not intended by the user. To quickly review, there are two big issues that trigger / wake word software can face:
- False Rejection — when the device doesn’t activate despite being called. This can be very frustrating.
- False Acceptance — when the device triggers by accident and when not intended by the user. This can be cute if it happens occasionally but annoying if it happens very often.
The issue I’m referring to is #2.
In the first implementation of triggers, they used low power models running on DSPs. The next evolution used larger trigger models running on application processors.
Then, there was distributed decision making, with both the target and host triggers needed to be triggered to prevent false acceptance.
Nowadays, while triggers reside locally to prevent constant eavesdropping of the device, Amazon and Google will send the buffered audio of the trigger to the Cloud to verify the trigger was spoken. If it’s determined that it was an accident, then they shut down the stream. This implementation also prevents multiple devices within earshot of the speaker.
The next generation of trigger may understand the context of the user. Maybe there’ll be a acknowledgement, like the device perking ups its ears instead of responding? Knowing how many people are talking around the device and what type of speech is being spoken could be helpful. For example, if I’m referring to Alexa mid-sentence, there’s no pause before speaking the wake word but if I’m actually saying “Alexa”, there’s usually some silence before.