Recently, I was an outdoor event with a lot of crowd and speaker noise and was thinking about making quick searches on the phone. Two kids in tow, I wanted to check on hours for the next place we’d be visiting and look up some information on a protest that we had spotted close by.
Either I’m getting older or there is a genuine issue — I am horrible at searching for things quickly on the phone. First it’s open lock screen by pressing home button, gazing like a monkey at my phone waiting for facial recognition or being told my teething cracker crusted finger can’t be recognized, pressing on the mic icon for Google Assistant, or swiping clumsily the keyboard and needing to correct my typing.
Also, I like to keep a clean home screen (my young daughter is great at finding the YouTube app) so if it’s searching twitter for micro-local news (like the protest), it requires searching for the app, again with swipe type, or using voice keyboard, adding about 5 seconds (push mic, trigger tone, dictation, waiting results) or longer if the speech to text is incorrect.
I’ve been reluctant to use Bixby or OK Google triggers from lock screen because of previous inconsistent interactions. When I’m outside on a task, I get only a few moments to perform a search. The risk is that I use the triggers and potentially save 30% off the interaction time or I need to revert after a failed attempt to the original manual search, increasing the query time by 50% (all anecdotal). It’s also 50–50 that this works. Since the risk-reward is not balanced most of the time unless the correctional query can be done later, I’ve started to rely more on manual search.
What compounds the issue is that mobile search is usually made when people are exposed to outside noises and crowds around them. For Apple, they control all aspects of the audio processing if the search is done through the on-device mic. They know the iPhone or iPads DSPs, they control the STT, and can adjust accordingly. Google has the same advantage on Google phones but all other devices start as a black box of audio processing. Samsung, Huawei, HTC, LG, etc (and even iOS devices) are all doing their own DSP before sending that audio to the Google Assistant. Yes, Google Assistant likely has enough data to build audio models for each phone, it does affect the processing.
For all devices, you can also add to this the complexity of whether a wired or Bluetooth headset is used for the audio capture. And the noise level.
What would be interesting for Google or Apple (or Samsung with Bixby) is for random audio condition sampling of an area to process noise and create a filter for speech recognition. How might this work?
- Android devices sample audio at regular intervals
- Quality of the audio is abstracted locally
- Abstracted data is sent to cloud service
- Location based filter is created
- All speech samples for that location are potentially fed through that filter for the period of time for which it still holds
Data for this could include:
- Ambient noise level
- Type of noise (crowd, natural, mechanical, etc)
- Reverberation level
- Temperature, humidity, and air pressure to help build model
Maybe this could help make speaking in a crowd a little more accurate and change the risk-reward balance for using speech on mobile?