DeepMind Uses Deep Neural Networks To Improve Text-to-Speech... and More

Monday, September 12, 2016

DeepMind Uses Deep Neural Networks To Improve Text-to-Speech... and More


Artificial Intelligence

Today's artificial speech tends to sound robotic, but using a new system called WaveNet, Google Deepmind has created a new system that produces much more natural human speech. While not perfect, it is 50% better than current technologies. Since it is at core a general audio processor, it can also create music.


"WaveNets are able to generate speech which mimics any human voice and which sounds more natural than the best existing Text-to-Speech systems, reducing the gap with human performance by over 50%."
Google DeepMind, has developed a new artificial intelligence-based voice synthesis system that sounds much more human than today's standard text-to-speech (TTS) engine.

DeepMind's system, called WaveNet uses a deep generative model of raw audio waveforms. "We show that WaveNets are able to generate speech which mimics any human voice and which sounds more natural than the best existing Text-to-Speech systems, reducing the gap with human performance by over 50%," they claim.

For instance, non-speech sounds, such as breathing and mouth movements, are also sometimes generated by WaveNet, and add a very natural quality to the output. Consider how effective such sounds are in our interactions everyday, or think about how Samantha conveyed such emotion by incorporating these intonations in the movie, Her.

The ability of computers to understand natural speech has been revolutionised in the last few years by the application of deep neural networks. But generating speech with computers is still largely based on so-called concatenative TTS, where a very large database of short speech fragments are recorded from a single speaker and then recombined to form complete utterances. This makes it difficult to modify the voice (for example switching to a different speaker, or altering the emphasis or emotion of their speech) without recording a whole new database.

neural network


The following figure shows the quality of WaveNets, compared with Google’s current best TTS systems  that use either parametric and concatenative algorithms, and with human speech. The data was obtained in blind tests with human subjects (from over 500 ratings on 100 test sentences). The results show, WaveNets reduce the gap between the state of the art and human-level performance by over 50% for both US English and Mandarin Chinese.

DeepMind Uses Deep Neural Networks To Improve Text to Speech... and More

For both Chinese and English, Google’s current TTS systems are considered among the best worldwide, so improving on both with a single model is a major achievement.

Related articles
It turns out that WaveNet can also be used for more than just voice generation. "We also demonstrate that the same network can be used to synthesize other audio signals such as music, and present some striking samples of automatically generated piano pieces."

Because WaveNets can be used to model any audio signal, the researchers thought it would also be fun to try to generate music. Unlike the TTS experiments, we didn’t condition the networks on an input sequence telling it what to play (such as a musical score); instead, we simply let it generate whatever it wanted to.

So, can we expect WaveNet in Google apps anytime soon? Probably not.  WaveNet has to create the entire waveform to perform its processing, and uses a neural network processes to generate 16,000 samples for every second of audio it produces, which aren't even high definition recordings.

According a DeepMind source who spoke to the Financial Times, that means we will have to wait a bit to see WaveNet used extensively in any of Google’s products. But as we know, exponential technology has a habit of catching up to, and beating our expectations in short order with technologies like this.





SOURCE  DeepMind


By  33rd SquareEmbed



0 comments:

Post a Comment