Despite more cloud services for voice (speech to text, NLU, text to speech) becoming available, Internet service becoming faster, and new router technology making WiFi more stable, there’s been an equally explosive growth in embedded solutions. These include phrase spotting software, ASR engines, dialogue management, sound analytics, emotion detection, and far field audio conditioning.
The result is two peaks are forming — local and cloud — that need a reliable bridge. About a year ago, Phil Lewer wrote about the hybrid approach to building voice interaction. We’ve spoken with many hardware companies that are starting to become confused with the myriad of embedded solutions that are now available and will soon become available to enabled ambient voice interaction. The result of the technology that’s come out is a two-order of magnitude increase in the different options that are available to the device maker.
Considerations that companies often have to take into account are:
How is the device powered? (battery, plug, or both)
What type of interaction is it? (single turn, multiple turns)
Push to talk or handsfree?
Near or far field?
AVS, Google Assistant, something else, or multiple services?
What is the desired price point?
BT, Wifi, other connectivity, multiple sources, or none?
What type of environment will the device be used in?
The more clarity device maker has in how they want their product to function, the easier are the decision on components and local services that can be employed. Where there are overlapping services or benefits, the tie breaking can come down to benchmarking these in a prototype of the product.