Following up from yesterday’s post on 4.9% WER, the new problem voice is in sending short messages accurately. Why is shortness an issue? With longer speech strings, the STT algorithms have a better chance to guess what the next word is in a sentence and use that to weigh an estimate.
With dictation of SMS, Slack, or other messages, short phrases or responses are not very accurate, especially when proper nouns come into the mix. Especially if you’re not proofreading, you can end up with some very awkward messages and a response that looks like this from the recipient: “????”.
There are at least two additional ways that STT services could get better accuracy for these dictated messages. The first is to predict the response of the user and use that to bias for speech that could match the predicted response. If Google Keyboard on Android can read through my Slack messages or SMS’s (if they don’t yet, they’d love this), then it could predict my response and use that for better STT.
The second way is to allow the recipient to correct the response and use that correction. Right now, proofreading happens on the sender’s side, if at all. The STT services then uses this to improve But what if the recipient could listen to the audio as well as see the STT and then allow the recipient to make corrections? This could at least add a few extra cycles of human taught corrections and also allow a recipient to hear what the sender had intended.
Ultimately, messaging technology should make us better humans. In getting there, it needs to make sure we’re properly understood.