A few days ago, Microsoft announced VALL-E, a new text-to-speech technology that can mimic a person’s voice with only three seconds of sample data. This comes at a time when Microsoft is contemplated a much larger stake in OpenAI.
At the Toronto Tech Summit in 2019, one person asked the panel I was on what was our biggest concern about where privacy and technology was heading. My answer was the combination of natural language generation with indistinguishably real speech synthesis. We’re getting there.
Before, it was Tacotron2, WaveNet, and Project Voco that had pushed the boundaries. Then Lyrebird came along to democratize the use of these tools. Microsoft is again making a big move in this space.
Soon, it might be possible for an AI to sound like you both in voice and substance. There are some incredibly dangerous uses of this technology that boggle the mind (think about Sarah Connor calling to John), but maybe ones that can leap us ahead, if we’re willing to offer up some autonomy over our voice and words.
The combination of both NLG and realistic TTS means we won’t be able to tell real vs not real apart. It might bring back loved ones or help provide us with some form of immortality. It might also help us scale ourselves to do more with our limited time.