As long as Alexa or Google Assistant is the main conduit for conversations for with bots, any individual bot will never be able to hold a better conversation than these AI assistants. Why? All of the intelligence will be associated with the main AI assistant’s voice holding the conversation. The other qualities of voice also can’t be conveyed such as rate of speech, prosody, emotion, and age etc.
Right now, the only workaround for voice interaction with a more aware bot is for a Skill to exist as a contact that the person can call through Alexa Calling. Then, presumably, Alexa isn’t listening in on the conversation and all of the voice information through SIP is transferred to the bot, that now has to process STT and NLU on its own (maybe through Lex). However, it can also capture the other voice quality information.
Perhaps there will exist a special category of Skills where the voice audio can also be sent to the Skill to do some other processing. Other potential ways the Skill can exude independence is to have a custom TTS response. My concern with this approach is the increased latency in the interaction would lead to a poorer experience.
Maybe Amazon (and long afterwards, Google) will be comfortable in opening up audio capture and transmission to certain Skills/Actions and allow for custom TTS to be hosted native to their servers? The benefit would be in keeping the interaction hosted on their platform as opposed to driving a competitor to create its own hardware / silo.