Cowabunga! Speechcons and Speech Synthesis

Image for post
Image for post

There’s something very jarring about the transition between Alexa’s speechcons (interjections) vs it’s text-to-speech. Quick refresh — speechcons are recordings of the text-to-speech speaker’s voice and not a combination of sounds to make something that sounds like a word. “Cowabunga” or “awesome” are examples. They’re very expressive because they’re real.

On the other hand, Alexa’s text-to-speech, while positive, is fairly flat emotionally. There’s also a bit of staccato in the rhythm of the speech. It’s really awful to hear anything spoken at length in this voice. In comparison, Tacotron2 text, at least the samples, seem much more tolerable to longer listening.

Where things become worse is when you play a very realistic expression of a speechcon and then right away play the Alexa TTS. The contrast makes Alexa sound worse.

So what’s the takeaway? If you’re building a Skill, maybe choose a voice actor to read out your text. If you use speechcons, keep them apart from TTS, maybe separating them with some music.

Written by

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store