Tag Archives: synthetic speech

The uncanny valley: does it happen with voices?

Android Repliee Q2 – is this in your Uncanny Valley? (Photo Max Braun
CC BY-SA 2.0)

The uncanny valley can happen when robots look almost human but certain features are not quite right: perhaps the eyes are too big or look lifeless, or the face combines both human and artificial features creating a nightmare version of Mr Potato Head. [1] The phenomenon has even been used to explain the failure of animated movies with creepy-looking characters like The Polar Express. Does the uncanny valley exist for artificial voices? A timely question given the rise of synthetic speech such as smart assistants.

What is the uncanny valley?

The phrase was coined by the Japanese professor Masahiro Mori in the 1970s. Mori sketched out a graph like the one below, which showed how people’s affinity towards a robot varied with its closeness to the human form. Imagine starting with an industrial robot that is clearly mechanical and then gradually altering its features so it becomes more and more human (moving right on the graph). Mori predicted that at a certain point, just before the robot appears to be fully human, affinity would flip to revulsion. The graph therefore shows a sharp dip forming the uncanny valley. Note, the graph is a simplification of what can happen. For example, it is perfectly possible for a near-human robot to cause mirth rather than unease [2], but I’m going to focus on when people are unnerved.

Diagram by Smurrayinchester, based on image by Masahiro Mori & Karl MacDorman CC BY-SA 3.0

The uncanny valley has been attributed to two effects that have been supported by experimental evidence [3]. One is the presence of atypical features. For example you might have a realistic human head on a robot (see picture below). The other effect is category ambiguity, where it is difficulty to determine whether a thing is human or robot (like the Android Repliee Q2 shown at the top of the page).

Albert Hubo robot. Photo by Dayofid at English Wikipedia, CC BY 2.5.

Uncanny synthetic speech

Do we get an uncanny valley with synthetic voices? It seems possible when there is an image of the talker as well as the voice. Then it is possible to get incongruity between the visual and audible modalities. The eeriness might be generated by the face movements and voice being a little out of sync or by a robot having a voice that is too human [4,5].

But what about a voice on its own? I’ve not found any evidence for this. It could be that the technology to create synthetic voices is not yet good enough for us to fall into the valley. But I’m unconvinced by that argument. There are plenty of synthesis samples where the talking is almost human with the occasional glitch, but that doesn’t seem to create a sense of revulsion. Maybe the atypicality has to be more obvious, where we take a piece of clearly synthetic speech and drop in the odd word voice by a human. That would be a vocal equivalent of the Albert Hubo robot (an experiment for the future?) Maybe the lack of revulsion from the voice is down to something more obvious. The image of Albert Hubo robot is disturbing because it looks like a real person has been decapitated and stuck on-top of a machine. It is hard to think of a vocal equivalent of this (without images being involved).

What about the other mechanism for the uncanny valley: category ambiguity? My experience is that if I spot something that is not right with synthesised speech, the category just shifts from human to artificial. Another reaction is to assume that something has distorted the voice before it reached the ears, after all we’re used to hearing voices mangled by mobile phones and Skype. Maybe the lack of feeling unnerved is because ambiguity doesn’t lead to unpleasant associations. Android Repliee Q2 above looks like there is something wrong and she is ill or maybe not quite alive. Imperfect synthetic speech is never going to sound like someone who is very ill saying their last words on their death bed!

The more I think about this, the more I doubt that improved synthetic speech will lead to the uncanny valley. There are a myriad of techniques that are used to modify voices for films, TV, games, radio, etc. We’ve all heard many examples of human voices that have been changed and augmented to make them sound less human. Characters such as monsters, aliens and robots nearly all get voiced by a human actor to begin with and then lots of audio processing is applied. I can’t think of an example that has had an equivalent creepy effect to the photo of Android Repliee Q2 above, however, from just the voice. Given the movie trope of benevolent forces wiping out the human race, I’m sure if there was a way of exploiting a vocal uncanny valley, a sound designer would have found a way of doing this in a radio drama.

What do you think? Have you ever encountered the uncanny valley with a disembodied voice? Could such a thing exist? Please comment below.


[1] Kätsyri, J., Förger, K., Mäkäräinen, M. and Takala, T., 2015. A review of empirical evidence on different uncanny valley hypotheses: support for perceptual mismatch as one road to the valley of eeriness. Frontiers in psychology6, p.390.

[2] Mäkäräinen, M., Kätsyri, J., Förger, K. and Takala, T., 2015, September. The funcanny valley: A study of positive emotional reactions to strangeness. In Proceedings of the 19th international academic mindtrek conference (pp. 175-181). ACM.

[3] Strait, M.K., Floerke, V.A., Ju, W., Maddox, K., Remedios, J.D., Jung, M.F. and Urry, H.L., 2017. Understanding the uncanny: both atypical features and category ambiguity provoke aversion toward humanlike robots. Frontiers in psychology8, p.1366.

[4] Tinwell, A., Grimshaw, M. and Nabi, D. A., 2015. ‘The effect of onset asynchrony in audio-visual speech and the Uncanny Valley in virtual characters’. International Journal of Mechanisms and Robotic Systems, 2(2), pp. 97–110.

[5] Mitchell, W. J., Szerszen Sr, K. A., Lu, A. S., Schermerhorn, P. W., Scheutz, M. and MacDorman, K. F., 2011. ‘A mismatch in the human realism of face and voice produces an uncanny valley’. i-Perception2(1), pp. 10–12.