Artificial intelligence is used to digitally reproduce human voices : NPR
Courtesy of Speech Morphing
The science behind making machines talk like humans is very complex, because our speech patterns are so nuanced.
“The voice is not easy to grasp,” says Klaus Scherer, professor emeritus of the psychology of emotion at the University of Geneva. “Analyzing the voice really requires a lot of knowledge about acoustics, vocal mechanisms and physiological aspects. So it’s necessarily interdisciplinary, and quite demanding in terms of what you have to master to do anything important. .”
It’s no surprise, then, that it took more than 200 years for synthetic voices to emerge from the first talking machine, invented by Wolfgang von Kempelen around 1800 – a box-like contraption that used bellows, pipes and a rubber mouth and nose to simulate some recognizable human utterances, like mom and dad — to a voice clone of Samuel L. Jackson delivering the weather report on Alexa today.
Talking machines like Siri, Google Assistant and Alexa, or a bank’s automated customer service line, now seem entirely human. Thanks to advances in artificial intelligence, or AI, we have reached a point where it is sometimes difficult to distinguish synthetic voices from real ones.
I wanted to know what the client-side process involved. So I contacted Speech Morphing, a natural language text-to-speech company based in the San Francisco Bay Area, to create a clone – or “digital double” – of my own voice.
Journalist gets her voice cloned
Given the complexity of text-to-speech, it’s a shock to discover how easy it’s to order one. For a basic conversational build, all a client has to do is record themselves saying a bunch of scripted lines for about an hour. And that’s about it.
“We extract 10 to 15 minutes of clean recordings for a basic version,” explains Fathy Yassa, founder and CEO of Speech Morphing.
The hundreds of sentences I record for Speech Morphing to build my digital vocal double seem very random: “Here the explosion of cheerfulness drowned it out.” “That’s what Carnegie did.” “I would love to be buried under Yankee Stadium with JFK.” etc
But they are not as random as they seem. Yassa says the company chooses utterances that will produce a fairly wide variety of sounds across a range of emotions – such as apology, enthusiasm, anger, etc. – to feed a training system based on a neural network. It basically teaches itself the specific patterns of a person’s speech.
Yassa says there are around 20 effects or tones to choose from, and some of them can be used interchangeably, or not at all. “Not every tone or effect is needed for every client,” he says. “The choice depends on the target application and use cases. Banking is different from e-books, is different from reporting and broadcasting, is different from consumer.”
At the end of the recording session, I send the audio files to Speech Morphing. From there, the company breaks down and analyzes my utterances, then builds the model from which the AI can learn. Yassa says the whole process takes less than a week.
He says the possibilities for the voice clone of Chloe Veltman – or “Chloney” as I affectionately call my robot myself – are nearly limitless.
“We can ask you to apologize, we can promote you, we can make you act like you’re in the theater,” Yassa says. “We can blackmail you, eventually, although we’re not there yet.”
A fast growing industry
The global speech and voice recognition industry is worth tens of billions of dollars and growing rapidly. Its uses are obvious. Technology has given actor Val Kilmer, who lost his voice to throat cancer a few years ago, the chance to reclaim something close to his old vocal powers.
It allowed film directors, audiobook creators, and game designers to develop characters without needing to have live voice talent, like in the film. Roadrunner, where an AI was trained on Anthony Bourdain’s vast archive of media appearances to create a digital duplicate of the late chef and TV personality’s voice.
As perfect as Bourdain’s digital vocal double might be, it also sparked controversy. Some people have raised ethical concerns about putting words into Bourdain’s mouth that he never said in his lifetime.
A cloned version of Barack Obama’s voice warning people of the dangers of fake news, created by actor and director Jordan Peele, drives home the point: sometimes we have reason to be wary of machines that look too much like us.
[Note: The video embedded below includes profanities.]
“We’re entering an era where our haters can make it look like anyone is saying anything at any time,” Obama’s deepfake says in the video, produced in conjunction with BuzzFeed in 2018. “Even s ‘they would never say those things.’
When too human is too much
Sometimes, however, we don’t necessarily want machines to sound too human, because that scares us.
If you’re looking for a digital double voice to read an audiobook to children, or act as a companion or helper for an elderly person, a more human voice might be the right way to go.
“Maybe not something that really breathes, because it’s a little scary, but a little more human might be more approachable,” says user experience and voice designer Amy Jiménez Márquez, who led the design of Amazon Alexa’s voice, multimodal and UX personality experience. team for four years.
But for a machine that performs basic tasks, like, say, a voice-activated refrigerator? May be less human is the best. “Having something a bit more robotic and you can even create a little voice that sounds like a real cute robot would be more appropriate for a fridge,” says Jiménez Márquez.
The big reveal
During a demo session with Speech Morphing, I hear Chloney, my digital voice double.
His voice comes to me through a pair of portable speakers connected to a laptop computer. The laptop brings up the programming interface in which the text I want Chloney to speak is typed. The interface includes tools for making micro-adjustments to pitch, speed, and other vocal attributes that might need to be tweaked if Chloney’s prosody doesn’t sound exactly right.
Hear “Chloney” recite “Happy Birthday”
“Happy birthday. Happy birthday. Happy birthday, dear Chloney. Happy birthday,” Chloney said.
Chloney can’t to sing “Happy Birthday” – at least for now. But she can read stories that I haven’t even flagged myself, like the one from an AP newswire about the COVID-19 pandemic. And she can even do it in Spanish.
Chloney looks a lot like me. It’s impressive, but it’s also a bit scary.
Listen to “Chloney” read a report in English
Here’s “Chloney” reading a report in Spanish
“My jaw is on the floor,” says the original voice behind Chloney – it’s me, Chloe – as I listen to what my digital double voice can do. “Let’s hope she doesn’t put me out of work anytime soon.”