This is incredibly important work.

1 min readApr 28, 2021

This is incredibly important work. Being able to apply SSML automatically based on context will help close the gap with speech synthesis. While the tools are in place, you found the biggest issue - applying SSML to make spoken text sound realistic is a lot of manual effort.

It was also great that you noted another challenge. If you blanket apply SSML based on some rules around detected sentiment, you'll end up with a jarring experience. Narrated books don't jump from happy to sad.

As a next step in this evolution, it would be great to have human taggers listen to audiobooks, tag the sentiment by sentence, and then compare these with the machine implemented SSML. How do transitions between emotions look for humans vs machines? Systems that can also automatically tag spoken speech with SSML will be another great tool.

Beyond this, another area is natural language generation based on tone. It's one thing to just read but when machines can generate text and express emotion in response to how it was spoken, that's the recipe for killer apps.

However, looking back at the goal of helping with reading, it's one thing to help ESL learn how to mimic expression. The expression eventually comes from a true understanding and synthesis of the language.

Thanks for this article!

Written by Leor Grebler