Earlier this summer, at the re:MARS conference—an Amazon-hosted event focusing on machine learning, automation, robotics, and space—Rohit Prasad, head scientist and vice president of Alexa A.I., aimed to wow the audience with a paranormal parlor trick: speaking with the dead. “While A.I. can’t eliminate that pain of loss, it can definitely make their memories last,” he said, before showing a short video that starts with an adorable boy asking Alexa, “Can Grandma finish reading me The Wizard of Oz?”
The woman’s voice that reads a few sentences from the book sounds grandmother-y enough. But without knowing Grandma, it was impossible to evaluate the likeness. And the whole thing struck many observers as more than a little creepy—Ars Technica called the demo “morbid.”
But Prasad’s revelation of how the “trick” was performed was truly gasp-worthy: Amazon scientists were able to summon Grandma’s voice based on just a one-minute audio sample. And they can easily do the same with pretty much any voice, a prospect that you may find exciting, terrifying, or a combination of both.
The fear of “deepfake” voices capable of fooling humans or voice-recognition technology is not unfounded—in one 2020 case, thieves used an artificially generated voice to talk a Hong Kong bank manager into releasing $400,000 in funds before the ruse was discovered.
At the same time, as voice interactions with technology become more common, brands are eager to be represented by unique voices. And consumers seem to want tech that sounds more human (although a Google voice assistant that imitated the “ums,” “mm-hmms” and other tics of human speech, though, was criticized for being too realistic).
That’s been driving a wave of innovation and investment in A.I.-powered text-to-speech (TTS) technology. A search on Google Scholar shows more than 20,000 research articles on text-to-speech synthesis published since 2021. Globally, the text-to-speech market is projected to reach $7 billion in 2028, up from about $2.3 billion in 2020, according to Emergen Research.
Today, the most widespread use of TTS is in digital assistants and chatbots. But emerging voice-identity applications in gaming, media, personal communication, are easy to imagine: custom voices for your virtual personas, text messages that read out in your voice, voiceovers by absent (or deceased) actors. The metaverse is also changing the way we interact with technology.