Using Generative AI for Marking Up Speech Synthesis

Leor Grebler
2 min readApr 17, 2023
Generated by author using Midjourney

One advance I was so excited about with Alexa Skills many years ago was the introduction of Speech Synthesis Markup Language (SSML) tags. With these tags, you could change the prosody, speed, and other characteristics of the speech in your Alexa Skill. Now, instead of being a static robotic sounding voice, Alexa could whisper or yell or speak more naturally.

However, adding SSML to text is time consuming and requires one to tailor a response. It also means analyzing text to understand the desired sentiment to communicate and then adding the SSML tags are the right parts of speech. If only there were a better, more scalable way!

Wait no more! It turns out ChatGPT is great for this type of task. All you need to do is first prime the response explaining what you’d like to do and then copy the text you want to markup for speech synthesis. Here’s a quick example:

Initial prompt:

I'm going to provide you with a short passage. I'd like you to update it with SSML tags based on an excited tone. Can you just acknowledge that you understand?

Then, just put in the text. Here’s an example of an output:

<prosody rate="fast" pitch="high">
<emphasis level="strong">Back in 2005,</emphasis> I was trying to finish my engineering degree. I was limping along, living at home, and realizing that I was missing out on generating income as a university graduate, trying to find creative ways to supplement my income while finishing my studies.
I tried all sorts of different things like buying lots on eBay and splitting them into smaller units and reselling. Many designer ties were sold in such a fashion. I tried to scale getting paid to surf using an oscillating fan to mimic screen activity. I had even explored walnut farming.

<emphasis level="strong">The most promising affiliate program at the time</emphasis> was Google’s offer to pay to encourage people to download Firefox with the Google Toolbar plugin. This was what could be considered a second browser war. MS Internet Explorer was still a big player and MSN Search had been coming out.

Google added a search toolbar to ensure that they wouldn’t lose market share of search and ad revenue. They were open to paying <emphasis level="strong">$1 for every referral,</emphasis> which was a program similar to AdSense.
</prosody>

You can ask it to re-do portions or change the tone, but this is a much more scalable way of adding more life to voice responses.

--

--

Leor Grebler

Independent daily thoughts on all things future, voice technologies and AI. More at http://linkedin.com/in/grebler