Abstract

As two of the five traditional human senses (sight, hearing, taste, smell, and touch), vision and sound are basic sources through which humans understand the world. Often correlated during natural events, these two modalities combine to jointly affect human perception. In this paper, we pose the task of generating sound given visual input. Such capabilities could help enable applications in virtual reality (generating sound for virtual scenes automatically) or provide additional accessibility to images or videos for people with visual impairments. As a first step in this direction, we apply learning-based methods to generate raw waveform samples given input video frames. We evaluate our models on a dataset of videos containing a variety of sounds (such as ambient sounds and sounds from people/animals). Our experiments show that the generated sounds are fairly realistic and have good temporal synchronization with the visual inputs.

Paper

Visual to Sound: Generating Natural Sound for Videos in the Wild
Yipin Zhou, Zhaowen Wang, Chen Fang, Trung Bui and Tamara L. Berg
[Preprint]

@article{v2s,
   journal   = {CoRR},
   year      = {2017},
   author    = {Yipin Zhou and Zhaowen Wang and Chen Fang and Trung Bui and Tamara L. Berg},
   title     = {Visual to Sound: Generating Natural Sound for Videos in the Wild},}
        

Dataset

Visually Engaged and Grounded AudioSet (VEGAS) [ZIP] [ANNOTATION] (forthcoming)
Please email yipin@cs.unc.edu if there are any problems.

Results

Can you tell which audio is generated?

     

         Generated Sound                             Real Sound

     

               Real Sound                            Generated Sound

     

     

               Real Sound                            Generated Sound

     

         Generated Sound                             Real Sound

     

     

         Generated Sound                             Real Sound

     

         Generated Sound                             Real Sound

     

     

               Real Sound                            Generated Sound

     

         Generated Sound                             Real Sound

     

     

               Real Sound                            Generated Sound

     

         Generated Sound                             Real Sound

     

     

         Generated Sound                             Real Sound

     

               Real Sound                            Generated Sound

     

     

         Generated Sound                             Real Sound

     

         Generated Sound                             Real Sound

     

     

               Real Sound                            Generated Sound

     

         Generated Sound                             Real Sound

     

     

               Real Sound                            Generated Sound

     

               Real Sound                            Generated Sound

     

     

         Generated Sound                             Real Sound

     

               Real Sound                            Generated Sound