I noticed something weird on my Echo Show today. It was something I had seen before and definitely something that was an issue we had faced with the Ubi and Ubi Kit many times over… failed endpoint detection.
Endpoint detection is when a speech-to-text API client detects that speech is no longer being spoken and marks the ending of the utterance for transcription. When endpoint fails, a stream of audio to an speech-to-text API could be left on forever.
Most STT engines have some type of timeout. After 8 seconds, usually, they’ll just set that as the end of speech. It seems that a few things have happened recently to Alexa and the first gen Echo show that caused the stream to go on much longer:
- The timeout might have been set to something a lot longer (maybe because it was a follow up question)
- The blue audio indicator line on the Echo Show continued to show direction of arrival and audio but it was because my daughter was sitting next to the device playing with a toy. The gains on the processing might have been set to amplify audio from the direction of arrival and interpret it as speech.
Eventually, after maybe 20 seconds (or maybe it just seemed like 20 seconds), the stream failed and I was able to re-trigger and make my request. But, it seems like an issue with endpoint.
Endpoint is very tricky and can be made more difficult when there is initially lots of noise during the start of the voice interaction and then the background noise stops mid-speech or when the direction of arrival changes (such as a person moving). DSP algorithms for far field might need to account for this so that all the STT API hears is just clear audio and when it stops.