There’s something very jarring about the transition between Alexa’s speechcons (interjections) vs it’s text-to-speech. Quick refresh — speechcons are recordings of the text-to-speech speaker’s voice and not a combination of sounds to make something that sounds like a word. “Cowabunga” or “awesome” are examples. They’re very expressive because they’re real.

On the other hand, Alexa’s text-to-speech, while positive, is fairly flat emotionally. There’s also a bit of staccato in the rhythm of the speech. It’s really awful to hear anything spoken at length in this voice. In comparison, Tacotron2 text, at least the samples, seem much more tolerable to longer listening.

Where things become worse is when you play a very realistic expression of a speechcon and then right away play the Alexa TTS. The contrast makes Alexa sound worse.

So what’s the takeaway? If you’re building a Skill, maybe choose a voice actor to read out your text. If you use speechcons, keep them apart from TTS, maybe separating them with some music.

