An awesome question came up in Monday’s session at SpeechTEK 2018: how do we measure human parity areas of AI and speech interaction? I was caught off guard a bit but it made me think.
For speech transcription, reaching or exceeding human parity is easier to measure. Is the Word Error Rate (WER) of the technology better than a human doing the same task? Other criteria can be added such as the time to complete the task or the amount of times someone can listen to a recording. But what about other areas of voice interaction?
For Natural Language Understanding, we can derive a test to see if the system is as good as humans in determining the intent and entities of a given command. Maybe we come up with a list of 1000 ways to ask for something and then see how people vs machines compare. But it starts to get sketchier…
What about emotion detection? What about voice biometrics?
We can probably infer that machines will be able to pick up and decipher audio much better than us but at some point, if they are as good at detecting emotions, can we really tell if they are better at detecting emotions?
Maybe this is a place where we can see the first signs of the Singularity? We won’t really be able to assess whether the machine is doing a better job or as good a job at us. One could imagine that machines could run a huge experiment testing different responses and gauging emotional reaction and then being better able to predict, for example, whether a joke will be funny and hone to a specific audience. It’s not clear.