Non-speech enhancements of voice interaction

Image for post
Image for post

Now that we’ve reached human error parity with speech-to-text technology (and that will soon expand into more domains of speech recognition), the next step will be for machines to understand our context and provide us with better information. The main ancillary technologies that might make voice interaction much more effective are:

  • User detection
  • Facial emotion detection
  • Gesture recognition

User detection. This can take come from many sources, including voice biometrics, but well likely see this also come from companion apps that send our location to a service that unlocks or informs devices when we fall within a geofence. It could also come from facial recognition (there is now a mushrooming of APIs available for facial recognition that include liveness detection). This could allow a scenario where I go to your Echo and would be able to access all of my Alexa Skills or make orders on my account.

Facial emotion detection. Just like vocal emotion detection can inform an NLU about our emotion and be able to flag context for particular phrases (e.g. saying “I’m fine” said in a sad way means the opposite), so can facial emotion detection. Microsoft Cognitive Services allow for this and there are likely many new players soon to enter this space. Facial emotion recognition could be used by a voice interaction system to gauge whether the system was correct in interpreting a user’s inquiry or whether the response was accurate. The breakthrough that will need to happen in this field is sample time. If it can get down to micro-expressions, then it could surpass even our own abilities to detect others’ emotions.

Gesture recognition. Head nodding, waving hands in parallel to the ground, pointing, etc, are all ways we communicate back to people during conversations. These could be used to interrupt a response or to enhance the sensitivity of trigger detection (have you caught yourself waving at the Echo when you wanted to get Alexa’s attention?).

The downside is that these technologies require the use of a camera that’s always watching. Maybe there’s a way that a local-only camera could be used to for this (such as not streaming video beyond the firewall… just the resulting interpretation)?

If we fast forward to when all of these functionalities are implemented, we would never really feel alone (…but we might also always feel watched).

Written by

Independent daily thoughts on all things future, voice technologies and AI. More at

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store