Alireza Kenarsari wrote a great piece about his company’s new product, Porcupine, that compares different wake word engines. I’ve approached new wake word engines with a bit of skepticism over the past few years as I’ve been jaded by our experiences in collecting audio samples and then feeding to an engine for training.
We had one experience where were were trying to train one engine (not Porcupine but I won’t name it) for tolerance to multiple English accents and what we found was that at some point, the more training samples we provided, the less reliable the engine became. It was very frustrating. We would then try to strike the balance between having the engine reject all requests to having it trigger correctly… but also if it heard the wind blow.
We had a similar experience for grammar-based engines (or “Phoneme-based”). These convert text to a trigger. We would come up with all sorts of interesting additional pronunciations to reduce the false rejection rate, but you’d have to treat the device like you were speaking to someone who couldn’t understand you well… speaking loudly and slowly. It was very unnatural.
With concerns about privacy and security, there’s a growing role for local based voice interaction. There’s also a need to develop backup interaction for when the device can’t connect to cloud.
Let’s hope that competition in this arena leads to even better solutions with more features and at a lower cost.