With millions of ambient voice interactive devices shipped, there is a lot of data that’s being collected that wasn’t available before to companies for machine learning before. In particular, far field voice voice samples weren’t available at scale. Sure, Google could have started to isolate samples of OK Google / Google Now voice requests through phones but this would make up only a fraction of a fraction of requests (but still potentially millions per month). Amazon can easily send from Echos both raw audio and processed audio to its ASR engines. By comparing these two for the millions of daily requests, it’s possible for Amazon to build a far field model and it seems that’s always been the intention (just based on the modes available publicly through AVS).
The benefits of such a voice model are that you can drastically reduce the upfront audio conditioning. This means a lighter power DSP and as a result, lower BOM costs to the hardware maker that’s looking to incorporate voice. It will make it more attractive for companies to add voice as a product feature.
There are still things that will be very difficult to do in a cloud-based model. Acoustic Echo Cancellation requires very fast sample rates so it’s unlikely that this can also be done at least in the short term from the cloud. As well, trigger word will likely be something that is kept local. People just don’t seem likely to accept the constant streaming of microphone signal from their home to somewhere.
The real benefit for far field ASR will come after one of the players in the field releases a Speech-t0-Text API with far field capability. This will allow for the creation of many low cost ambient voice interactions and for freedom to design the interaction as the hardware maker desires.