Speech Recognition Reaches Human Parity

Paraphrasing Kevin Kelly (again), predicting the future mostly has to do with understanding the present. Back in May at Speechtek, I talked about how within the next year, speech recognition was going to reach human performance levels. Apparently, that’s the case today. Two days ago, Microsoft announced its latest results with Cortana — an error rate of 5.9%. This matches human transcription error rates.

The significance of this is big — not because it puts manual transcription out of business (it won’t — not yet), but because it means it’s another fete of machines encroaching on human cognitive abilities. Vision APIs can now scan photos for Coca Cola logos or cat faces much faster than we can.

For it to no longer make sense to have any human proofreading, the error rate might need to get to 1%. The same reported that for last month, the rate was 6.3%. Since ASR / error rate performance gets exponential harder *and* since the technology also increases exponentially, we might be able to assume a linear decline in error at the same rate — 0.4% per month. Maybe it’ll be a year before we get to 1%? Maybe two years? Let’s be conservative and say five years. At that point, the equivalent performance of today will spread to other languages, as will improvements in far field modeling of voice. At that point we’re looking at the end of stenography.

