The uncanny valley can happen when robots look almost human but certain features are not quite right: perhaps the eyes are too big or look lifeless, or the face combines both human and artificial features creating a nightmare version of Mr Potato Head.  The phenomenon has even been used to explain the failure of animated movies with creepy-looking characters like The Polar Express. Does the uncanny valley exist for artificial voices? A timely question given the rise of synthetic speech such as smart assistants.
What is the uncanny valley?
The phrase was coined by the Japanese professor Masahiro Mori in the 1970s. Mori sketched out a graph like the one below, which showed how people’s affinity towards a robot varied with its closeness to the human form. Imagine starting with an industrial robot that is clearly mechanical and then gradually altering its features so it becomes more and more human (moving right on the graph). Mori predicted that at a certain point, just before the robot appears to be fully human, affinity would flip to revulsion. The graph therefore shows a sharp dip forming the uncanny valley. Note, the graph is a simplification of what can happen. For example, it is perfectly possible for a near-human robot to cause mirth rather than unease , but I’m going to focus on when people are unnerved.
The uncanny valley has been attributed to two effects that have been supported by experimental evidence . One is the presence of atypical features. For example you might have a realistic human head on a robot (see picture below). The other effect is category ambiguity, where it is difficulty to determine whether a thing is human or robot (like the Android Repliee Q2 shown at the top of the page).
Uncanny synthetic speech
Do we get an uncanny valley with synthetic voices? It seems possible when there is an image of the talker as well as the voice. Then it is possible to get incongruity between the visual and audible modalities. The eeriness might be generated by the face movements and voice being a little out of sync or by a robot having a voice that is too human [4,5].
But what about a voice on its own? I’ve not found any evidence for this. It could be that the technology to create synthetic voices is not yet good enough for us to fall into the valley. But I’m unconvinced by that argument. There are plenty of synthesis samples where the talking is almost human with the occasional glitch, but that doesn’t seem to create a sense of revulsion. Maybe the atypicality has to be more obvious, where we take a piece of clearly synthetic speech and drop in the odd word voice by a human. That would be a vocal equivalent of the Albert Hubo robot (an experiment for the future?) Maybe the lack of revulsion from the voice is down to something more obvious. The image of Albert Hubo robot is disturbing because it looks like a real person has been decapitated and stuck on-top of a machine. It is hard to think of a vocal equivalent of this (without images being involved).
What about the other mechanism for the uncanny valley: category ambiguity? My experience is that if I spot something that is not right with synthesised speech, the category just shifts from human to artificial. Another reaction is to assume that something has distorted the voice before it reached the ears, after all we’re used to hearing voices mangled by mobile phones and Skype. Maybe the lack of feeling unnerved is because ambiguity doesn’t lead to unpleasant associations. Android Repliee Q2 above looks like there is something wrong and she is ill or maybe not quite alive. Imperfect synthetic speech is never going to sound like someone who is very ill saying their last words on their death bed!
The more I think about this, the more I doubt that improved synthetic speech will lead to the uncanny valley. There are a myriad of techniques that are used to modify voices for films, TV, games, radio, etc. We’ve all heard many examples of human voices that have been changed and augmented to make them sound less human. Characters such as monsters, aliens and robots nearly all get voiced by a human actor to begin with and then lots of audio processing is applied. I can’t think of an example that has had an equivalent creepy effect to the photo of Android Repliee Q2 above, however, from just the voice. Given the movie trope of benevolent forces wiping out the human race, I’m sure if there was a way of exploiting a vocal uncanny valley, a sound designer would have found a way of doing this in a radio drama.
What do you think? Have you ever encountered the uncanny valley with a disembodied voice? Could such a thing exist? Please comment below.
 Kätsyri, J., Förger, K., Mäkäräinen, M. and Takala, T., 2015. A review of empirical evidence on different uncanny valley hypotheses: support for perceptual mismatch as one road to the valley of eeriness. Frontiers in psychology, 6, p.390.
 Mäkäräinen, M., Kätsyri, J., Förger, K. and Takala, T., 2015, September. The funcanny valley: A study of positive emotional reactions to strangeness. In Proceedings of the 19th international academic mindtrek conference (pp. 175-181). ACM.
 Strait, M.K., Floerke, V.A., Ju, W., Maddox, K., Remedios, J.D., Jung, M.F. and Urry, H.L., 2017. Understanding the uncanny: both atypical features and category ambiguity provoke aversion toward humanlike robots. Frontiers in psychology, 8, p.1366.
 Tinwell, A., Grimshaw, M. and Nabi, D. A., 2015. ‘The effect of onset asynchrony in audio-visual speech and the Uncanny Valley in virtual characters’. International Journal of Mechanisms and Robotic Systems, 2(2), pp. 97–110.
 Mitchell, W. J., Szerszen Sr, K. A., Lu, A. S., Schermerhorn, P. W., Scheutz, M. and MacDorman, K. F., 2011. ‘A mismatch in the human realism of face and voice produces an uncanny valley’. i-Perception, 2(1), pp. 10–12.
I fell like the uncanny valley effect may still be present with voices. I find paranormal investigation documentaries fascinating (real or not, I prefer to suspend my disbelief for the sake of entertainment) I have however found that the “evp” voices that are presented in many of these are often extremely unsettling. I can’t quite pinpoint whether it’s because I choose to let them seem scary or if they genuinely are.
What about an edited or synthesised version of a person saying something they would clearly never say? What if it were a dead famous person speaking as if they were still alive? What if they addressed you directly? All things easily done by cutting phrases together but these rarely sound convincing due to jumps in the tone and recording, but if those modulations were smoothed out would you approach the uncanny valley?
Re Samantha’s comment on EVP, surely there are many other factors at play in making that creepy given the context?
I remember watching, “The Phantom Menace,” and finding Natalie Portman’s voice as Padme very strange; a little robotic but didn’t think much of it (the movie was so bad that this quirk just fit in with everything else that was plodding and wooden about it) .
Well, apparently her voice was digitally lowered to sound more like a Queen rather than a young girl. Is this somer version of the uncanny valley? Not sure. I know Portman’s real voice so that made my ears perk up but it was also slightly mechanical sounding. Is that what made me think she sounded wooden? Again, not sure but it was very strange to my ears.
One suggestion someone made to me was that we’ve dabbled in fake human voices for a long time (electronic means date back at least 80 years e.g. voder), so maybe we’ve just got used to weird voices already. And even before then, people can do amazing strange things with their voice, we can acoustically distort things much more than we can with a face.
I was going to post a link to a reddit discussion of her voice in the movie but can’t here so just search for that there.
Happy to read this article.
I just had a feeling that’s similar to feelings I have with watching some animated movies, while I was hearing someone singing with a voice effect pedal, creating harmonies. These harmonies were close to his real voice but still kinda robotic.. gave me chills, and non of the good ones.
I was immediately thinking of there were studies about Uncanny Valley and Voices.
Fascinating thoughts. The original UV thesis is visual and visceral. Its applicability to sound seems to vary tremendously, for scenario to scenario and person to person. I work on digital conversations in voice and text, and it occurred to me reading your post that perhaps the effect in the soundscape is insignificant but present in the short term, but cumulatively significant over the longer term. Perhaps there’s a crossover with evolutionary social psychology here too, which could augment the feeling. Could the audio version of The Uncanny Valley be connected less to a gut reaction and more with a hardly traceable – possibly evolutionary – anthropomorphic tendency. Your thoughts on the subject will certainly help us as we embark on our research to develop more relatable interactions with digital avatars.