Titan Note, an Indiegogo project, released a device that it hopes will allow us to focus during meetings and lectures on the content, rather than capturing material. The device listens to the background speech and transcribes it for the user.
You can check them out here:
When brainstorming about the Ubi many years back, we had the idea that eventually mics would always be listening to what we said and when we left meetings, all of the action items would already be an our email and calendars. We’d just have to sit, talk, decide, and work.
Getting to this reality is fairly difficult and I definitely applaud Titan Note for taking steps to get there. There are a few issues that need to be addressed to get to this point:
Speech-to-text. Even in perfect settings (on a trained system through a noise-cancelling headset), automatic transcription has errors. Even human transcribers have errors, even though their accuracy rate has now been surpassed by computers.
Far field. Lecture halls and meeting rooms present all types of issues of a quiet speaker to other people chatting around the recorder. Every room has its own acoustic profile that can affect speech recognition. Reverb and noise has a deep impact on results against speech engines.
Diarisation. This is the process of separating text between different speakers. If multiple people make comments or ask questions in meetings or lectures, the system needs to identify that this is a different speaker talking and assign the transcription to that user.
Identification. On top of diarisation, speaker identification needs to be used to assign labels for the speaker. This requires speech samples in advance in order to assign the right label.
Context extraction. What are the important points to remember? What are the action items? Dates? To-dos?
Summarization. Even for humans, this takes a lot of skill to ensure that the most salient points are put together into a synopsis of the content for TL;DR type of summary.
The last issue for this technology, at least to partially implement today, is cost. It can cost in the hundreds of dollars per hour for human assisted transcription transcription and even using speech engines for transcriptions cause either increased hosting costs or service use costs.
In the end, this technology will happen and be ubiquitous. Amazon or Google may be the players who make it feasible because there’s a revenue model (selling stuff) that can be made more effective through extracting user content (e.g. Gmail).
Hat tip to Ron for the heads up on Titan Note.