It’s good to remind ourselves that there’s a ton of information that’s conveyed through voice that we don’t yet fully take advantage of in the current interactions being created. Right now, we mostly just take the text data and then from there, the sentiment, intent, and entities, but there’s much more.
- Rate of speech
- Word emphasis
- Voice sentiment
- Health information
Many of these can fall under the category of prosody. Since prosody is already modifiable through text to speech, it might be possible to reverse this and map speech to text responses with a prosody map like SSML tagging.
The bigger question is what will we do as voice designers with this information? Will we have the ability to modify our content with more inputs?