When it comes to voice, one of the areas of frustration is that the speech recognition service doesn’t detect the end of speech quickly or detects it too quickly. The result is that the device appears to have a very high latency for interaction (which we they equate with lack of intelligence) or cuts us off mid-sentence (which we equate with rudeness).
Specifically, the end of speech detection above other endpoints is most frustrated (e.g. end of trigger detection or end of response) when things don’t work. When originally testing the Ubi, we had seen many different instances. Some of the main causes we found were:
- Bad noise isolation. Endpoint detection usually relies on finding the difference between quiet periods and when there’s noise. If there’s a low signal to noise ratio (SNR), then it will be more difficult for the device to identify when someone has stopped speaking.
- Bad endpoint algorithm. There are many ways of making a voice activity detector algorithm. They range from the simple (is there someone talking within a certain frequency and at a given level and has that signal stopped) to the complex (relying on machine learning algorithms and neural networks). Poor implementation will lead to the microphone staying on.
- Bad UI design. Certain commands require thought mid stream while others don’t. Telling Echo to “Turn on the lights” is fairly straightforward while asking Google Home “Play videos of owls on my office TV” might involve a pause to think. Adding flexibility for ums and ahs is important for longer commands.
Focusing on the weakest parts of the voice interaction chain can lead to big wins in happy users.