An Irish startup has claimed a breakthrough in text-to-speech synthesis that improves on public demonstrations by Google’s DeepMind and Facebook.
The result is an artificial voice that lacks many of the glitches in intonation heard from digital assistants like Siri or Amazon’s Alexa. It sounds eerily human, and shows that you no longer need a multi-billion dollar R&D budget or hundreds of engineers to produce an artificial voice that’s as good as Google’s.
Voysis has shared its audio sample exclusively with Forbes, an automated reading of Anna Sewell’s novel “Black Beauty,” and you can listen to it here:
Voysis founder Peter Cahill insists that the sample above has not been pre-recorded by a human, but produced by an algorithm that was trained on a popular dataset for building text-to-speech software. The demo is also significantly longer than comparable ones released by Google, Facebook and Baidu, which you can listen to further down this story.
The technology’s secret sauce is nothing new. In fact, it’s not even unique to Voysis. It’s a method called wavenet, discovered by researchers at Google’s DeepMind and published as a research paper in September 2016. The method uses a particular type of neural-network architecture to create sound, and is said to represent a significant leap forward in artificial-voice technology. It also raises difficult questions about how close to “human” we want our artificial voices to sound.
The development comes at a time when digital assistants are becoming more popular because they exist not only on smartphones but on smart speakers in the home, an environment where consumers feel more comfortable to speak out loud to devices. Recent stats from Apple suggest Siri has more than 41 million monthly active users; in other words, more than 1 in 10 Americans talk to Siri at least once per month.
But wavenet hasn’t received much public attention because most consumers haven’t been able to experience it yet, and if they have, it has shown up as an automatic update with little fanfare.
Google only this month started updating the artificial voice on its Google Assistant smartphone app and Google Home speakers in the U.S. and in Japan to use wavenet, a Google spokesperson confirmed to .
Here’s a demo of an artificial voice from Google using the wavenet method, released in October:
The method represents 50% improvement in artificial voice generation, according to Google’s own research, after years of tiny, incremental improvements to popular use-cases like Siri and Alexa. It means Google Home and other companies that use wavenet should very quickly start getting better at pronouncing people’s names and locations, and sound less glitchy overall.
“Previous developments were improved by 1% a year,” says Cahill.
Wavenet, he says, is the biggest breakthrough in artificial voice generation in more than two decades. “The new generation of speech technologies are going to emerge on the back of this.”
Who else is using wavenet to improve their own voice technology? Cahill says the answer is any company that wants to be able to interact with consumers through voice, which ticks off all the big names like Apple, Amazon, Facebook and Baidu.
Some have been open about tinkering with wavenet and released demos of their latest work.
Here’s the most recent wavenet demo from researchers at Baidu, released in February 2017:
And from a research team at Facebook, released in July 2017:
Neither Apple nor Amazon have released any demos of their work on wavenet, but both companies are almost certainly working on the new technique.
Apple said in a blog post in August that it had used deep learning to make a significant upgrade on Siri’s voice to make it sound less robotic in iOS 11. You can compare the differences with Siri from iOS 9 here, and you’ll notice that the Siri of today sounds much more natural already.
Apple also dismissed wavenet in that same blog post, saying it wasn’t feasible to use it in services like Siri yet because of its “extremely high computational cost.”
Google got around that problem only recently. When it first released wavenet last year, it took about two minutes to generate a two-second audio clip. But since then, Google’s implementation of wavenet has been upgraded to generate sound 20 time faster; it can generate 2 seconds of sounds in 0.1 seconds.
Surveys show artificial voices made with wavenet sound closer to human than anything invented. A common method for measuring how natural speech sounds is to give it a score of 0 to 5. Natural speech typically gets a score of 4.5 (factoring in the suspicion listeners have when asked to judge the naturalness of speech.) Traditional artificial voices, like the ones you hear on with train and bus announcements or from digital assistants, score at around 3.8. Google’s wavenet audio scored at 4.2.
The most popular method for creating an artificial voice till now has been to use the so-called concatenative method, which involves recording a huge amount of voice data and slicing it up into small units. Software then figures out the best parts to stitch back together. But these voices can sound glitchy and have unnatural jumps in pitch, whereas wavenet audio sounds much smoother.
Wavenet is still relatively new, and according to Cahill, senior voice engineers at some of Google’s competitors initially believed Google’s first public demo of the method was a PR stunt. “It’s not a PR stunt, though,” he says.
Cahill suggests companies like Apple might find it difficult to suddenly focus their attention on another technique for political and financial reasons; their executives will have had to make an uncomfortable strategic shift in budgets and direction.
They spend "tens of millions of dollars catching up to Google," says Cahill. "Then realize Google had this other thing, and are a generation ahead."
Not only that, but Google's DeepMind has also given away the blueprints for wavenet to anyone who wants them — in the form of a research paper released last year — meaning everyone in the industry can start on the same page.
So how did Voysis with its team of 10 engineers come up with a better wavenet demo than the likes of Facebook and Baidu? (Bear in mind that Baidu spent more than $400 million on R&D, mostly going towards artificial intelligence, in the first quarter of 2017. The first quarter.)
Cahill says he’s carefully picked the right talent and built up a network of contacts in voice technology since 2002, chairing a speech-synthesis special interest group and helping to organize conferences. One of his senior engineers, Ian Hodson, formerly led Google DeepMind’s text-to-speech work in London, in the same office and with the same team that developed wavenet itself.
Another speech scientist who worked on wavenet at DeepMind and left earlier this year, says the latest demo from Voysis is “essentially glitch-free.” Anthony Tomlinson doesn’t work with Voysis but verified its improvements to Forbes. “I’m impressed with it,” he says.
Cahill recalls the moment in September 2017 when he and his engineering team listened to their latest wavenet demo on laptop speakers in a meeting room at their Dublin, Ireland office. “Everyone realized it was a lot better,” he says. “It was a room full of smiles.”
Wavenet acts like a “renderer” that gives sound a higher resolution, just as a 4K screen would make movies appear more real, says Tomlinson. “That’s what wavenet does for speech synthesis.”
Over time, wavenet can also make it possible for software to manipulate existing voices into saying things that are close to natural, without having to spend hours in a booth recording thousands of units of speech.
Some companies claim to have accomplished this already. In November 2016 Adobe teased a new application called Project Voco which it called “Photoshop for speech.” In a demo, the company edited a recording of a man saying “I kissed my dogs and my wife” and manipulated it to say “I kissed Jordan three times.” Adobe has yet to release Voco to consumers.
Then in April this year startup Lyrebird said it could recreate any voice using just one minute of sample audio. In its demo, the startup synthesised the voices of Donald Trump and Barack Obama to talk about the company. But Adobe has probably used the traditional, “concatenative” method for voice synthesis, says Cahill. “I would expect it to be very inconsistent.” He maintains that the wavenet method from Google to lead to much higher-quality “mimics” of people’s voices.
How much do we want computers to sound like humans? Cahill recalls one of the authors of DeepMind’s landmark wavenet research paper, Heiga Zen, saying at a Speech Synthesis workshop one year ago that within three years, people wouldn’t be able to identify if they were listening to a machine or human.
“No one in the audience disagreed with him,” Cahill says. “Nobody thought it was crazy.”
Call centers could outsource more of their human work if they could use artificial voices that sound more natural. Video games could include an infinite array of natural-sounding dialogue, and advertisers wouldn’t have to keep bringing voice actors - or celebrities - into the studio. “You’ll just have a license agreement to mimic someone’s voice,” says Cahill.
This works because wavenet isn’t fundamentally about speech but the ability to create audio. Google, for instance, was able to train the neural-network used for its wavenet on classical piano performances too. The result was something you could easily hear on stage at Carnegie Hall:
There’s of course a darker side to all this, particularly around biometric security and internet banking systems that use your voice to authenticate your identity.
Cahill doubts that voice biometrics systems will be able to stand up to wavenet as more technology companies use the technique.
“It won’t just sound human to people,” he warns, “but to machines too.”