Yesterday, Amazon announced two partnerships for the Alexa team — Kitt.AI (a company the Alexa Fund had invested in) and Sensory for providing the Alexa wake up word. This announcement provides another missing piece in puzzle for hardware companies to integrate Alexa Voice Service into their products (essentially, making them into Echo-like devices).
According to the T&Cs of AVS, the implementation can either be push-to-talk but if it were handsfree, the device accessing AVS would require an “Alexa” wake word and far field capability. However, there was no clear picture of who would provide these. Now, it’s official — that Sensory or Kitt can be the providers. That said, there will likely be some licensing of the respective companies’ software.
For any far field voice interaction, the chain of services is this:
Audio Conditioning → Wake Up Word → Speech-To-Text → User Intent (NLU or Rules) → Integrations → Text-To-Speech (or some other acknowledgement)
AVS had only covered the Speech-To-Text onwards, leaving hardware companies with the task of having to implement Wake Up Word on their own. This new announcement means that there is a readily available Alexa trigger word for companies to implement. We have experience with Sensory — having used it as the OK Ubi trigger on our own device. TrulyHandsFree is a reliable trigger software to work with (vs Vocon or Sphinx) and even more importantly, the company is easy to work with.
The last missing piece, which will likely be filled soon, is the audio conditioning component. There are a number of companies that have far field DSPs and it’s likely that Amazon will provide some guidance on this at some point.
The other part of the Alexa announcement was cloud end pointing. Before, those who implemented AVS handsfree (we’ve done this on a few different devices) needed to setup their own end pointing — or detecting the end of spoken commands. Now, AVS will be able to send when it detects end of speech and then process the command. This takes a big burden off the developer. We had seen that end of speech issues could lead the mic to continue to stream audio to AVS, especially when there was a lot of background noise.
It’s going to become a lot easier to implement AVS on hardware product.