A Neural Network Learns to Generate Voice

Monday, August 22, 2016

A Neural Network Learns to Generate Voice


Artificial Intelligence

A researcher has looked at what happens when you throw raw audio (which happens to be a cute voice) into a neural network and then tell it to spit out what it's learned. The results are very intriguing.


By now, our readers are probably familiar with the basics of neural networks and the interesting possibilities of the technology. For example, with image processing, deep neural networks can produce incredible hallucinogenic results, or merge the style of one image into another. Now, one researcher has examined what happens when you throw raw audio (which happens to be a cute voice) into a neural network and then tell it to spit out what it's learned.

The example below is a recursive neural network (LSTM type) with 3 layers of 680 neurons each, trying to find patterns in audio and reproduce them as well as it can. It's not a particularly big network, claims the author, considering the complexity and size of the data, mostly due to computing constraints, which makes me even more impressed with what it managed to do.

Related articles
The audio that the network was learning from is voice actress Kanematsu Yuka voicing Hinata from Pure Pure. The researcher used 11025 Hz, 8-bit audio because sound files get big quickly, at least compared to text files - 10 minutes already runs to 6.29MB, while that much plain text would take weeks or months for a human to read.

Using the program Torch-rnn, which is actually designed to learn from and generate plain text the researcher wrote a program that converts any data into UTF-8 text and vice-versa, and to his excitement, Torch-rnn happily processed that text as if there was nothing unusual.

"I did this because I don't know where to begin coding my own neural network program, but this workaround has some annoying restraints. E.g. torch-rnn doesn't like to output more than about 300KB of data, hence all generated sounds being only ~27 seconds long," he writes.

It took roughly 29 hours to train the network to ~35 epochs (74,000 iterations) and over 12 hours to generate the samples (output audio). These times are quite approximate as the same server was both training and sampling (from past network "checkpoints") at the same time, which slowed it down.

The next step will be to try a bigger network.




SOURCE  Something Unreal


By 33rd SquareEmbed


0 comments:

Post a Comment