You can notice very quickly how awkward it is to force someone not to use their hands when speaking. On the other extreme, people can over gesticulate, making them look like Khruschev banging his shoe on the UN podium (fake, by the way).
However, when we blindly think of voice interaction, we completed overlook the matter of it being natural for us to use our hands to express what we say with our mouth and words. These gestures can evoke emotion or change the meaning of words. Combined with facial expression, they give the complete meaning of our communications.
This is where camera-borne voice interactive devices have an advantage over microphone-only devices. Jibo and other similar robots can combine both voice interaction with camera-based emotion recognition to be able to inform a natural language understanding engine.
As of yet, there doesn’t appear to be any publicly available NLU that can combine both language and emotion to quickly build an appropriate understanding of intent (e.g. STT result is “fine” but emotion is detected as negative means that the user intent is the equivalent of “no”). Maybe this is the next frontier for NLUs?