BT

InfoQ Homepage News Artificial Intelligence Can Create Sound Tracks for Silent Videos

Artificial Intelligence Can Create Sound Tracks for Silent Videos

Bookmarks

Researchers Ghose and Prevost created a deep learning algorithm which, given a silent video, can generate a realistic sounding synchronized soundtrack. 

Frequently, movies have sound effects added that were not recorded during the filming in order to make it feel more realistic. This is a process called "Foley". Researchers at the University of Texas turned to deep learning to automate this process. They trained a neural network on 12 popular movie events where directors frequently added Foley effects. Their neural network classifies the class of the sound to generate, and also has a sequential network that generates the sound. They thus used neural networks to go from temporally aligned images to the generation of sound, a whole different modality!

The first thing the researchers did was create a dataset (the Automatic Foley Dataset) containing short movie clips with 12 movie events. For some movie events they generated sounds themselves inside a studio (such as cutting, footsteps, and a clock sound). For other events (such as gunshots, a horse running, and fire) they downloaded video clips with sounds from YouTube. They recorded 1000 videos with an average duration of 5 seconds. 

The next step is predicting the right class of the sound. For this they compared two approaches: a frame sequence network (FSLSTM) and a frame relation network (TRN). In the frame sequence network approach they take each video frame. They then interpolate frames between the existing frames in the video for more granularity. A ResNet-50 convolutional neural network (CNN) extracts image features. The sound class is then predicted using a recurrent neural network called Fast-Slow LSTM fed with the image features. In the Frame Relation Network they tried to capture the detail transformations and actions of the objects with less computational time. The frame relation network (or more precisely, the Multi-Scale Temporal Relation Network) compares features from frames at N distances apart, where N takes on multiple values. In the end all these features are combined again using a multilayer perceptron. 

The last step is generating the sound for this class. To do this, the researchers used the Inverse Short Time Fourier Transform method. For this method, they first determine the average of all spectrograms of each sound class in their training set. This way they get a good (average) start for a sound generation. The neural network then only has to predict the delta to this average sound anchor for every sampling step of the sound. 

Four different methods were used to evaluate the performance of the algorithm, among which a human qualitative evaluation. They asked local college students to pick the most realistic sound, the most suitable sound, the one with minimum noise, and the most synchronized sound sample. These students preferred the synthesized sound over the original sound in 73.71 percent of the cases for one model, and in 65.96 percent of the cases for another model. The preference for each model also depended on what was in the video: one model performed better on the scenes with many random action changes. 

You can judge for yourself whether the final result feels realistic with this video of a fire, this video of a horse, and this video of rain. You can read more about their approach in their paper

Rate this Article

Adoption
Style

Hello stranger!

You need to Register an InfoQ account or or login to post comments. But there's so much more behind being registered.

Get the most out of the InfoQ experience.

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Community comments

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

BT

Is your profile up-to-date? Please take a moment to review and update.

Note: If updating/changing your email, a validation request will be sent

Company name:
Company role:
Company size:
Country/Zone:
State/Province/Region:
You will be sent an email to validate the new email address. This pop-up will close itself in a few moments.
午夜电影_免费高清视频|高清电视|高清电影_青苹果影院