Being inspired by some of the talks at the Voice Summit today, I’m thinking about how voice technologies might expand in the future to add more customization possibilities within responses. Yes, from voice we can get the text and context, but what about the subtext?
By measuring rate of speech (simple enough to gather by looking at the voice activity duration and the number of words), we can get the level of energy of the voice. We can also potential look at the prosody of the speaker to find out emphasis and tone.
If we’re unable to get emotion from voice, we can get it from vision. Today, there already are APIs for determining emotion from a facial expression. Azure Cloud has APIs as does Google with Google Cloud Vision API. Even Amazon AWS Rekognition can now do this.
Combining these together can give a good insight into what the user is trying to say. It’s likely that the next challenge for Skill and Action builders will be building the appropriate responses for the tone of the user.