An Overview of Visual Sound Synthesis Generation Tasks Based on Deep Learning Networks
DOI:
https://doi.org/10.62051/acf99a49Keywords:
AI generated content; Video onomatopoeia synthesis; Automatic sound synthesis.Abstract
Visual sound synthesis (which refers to the process of recreating, as realistically as possible, the sound produced by the movements and actions of objects within a video, given specific conditions such as video content and accompanying text) is an important part of the composition of high-quality films at present. Most traditional methods of sound synthesis are based on the artificial creation of simulated props for sound effects synthesis, which is achieved by using various existing props and constructed scenes. However, traditional methods cannot meet specific conditions for sound effect synthesis and require large amounts of participant, material resources and time. It can take nearly ten hours to simulate realistic sound effects in a minute-long video. In this paper, we systematically summarize and consolidate current advances in deep learning in the field of visual sound synthesis, based on existing related papers. We focus on the exploration and development history of deep learning models for the task of visual sound synthesis, and classify detailed research methods and related dataset information based on their development characteristics. By analyzing the technical differences among various model approaches, we can summarize potential research directions in the field, thereby further promoting the rapid development and practical implementation of deep learning models in the video domain.
Downloads
References
Andrew Owens, Phillip Isola,Josh McDermott,Antonio Torralba,Edward H. Adelson, William T. Freeman al. "Visually Indicated Sounds." In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
Andrew Owens, Jiajun Wu, Josh H. McDermott, William T. Freeman, Antonio Torralba al. "Ambient Sound Provides Supervision for Visual Learning." In Proceedings of the European Conference on Computer Vision (ECCV), 2016.
Lele Chen, Sudhanshu Srivastava, Zhiyao Duan, Chenliang Xu,et al. "Deep Cross-Modal Audio-Visual Generation." In Proceedings of the ACM International Conference on Multimedia (ACM MM), 2017.
Wangli Hao, Zhaoxiang Zhang, He Guan, et al. "A Uniform Framework for Cross-Modal Visual-Audio Mutual Generation." In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), 2018.
Yipin Zhou, Zhaowen Wang, Chen Fang, Trung Bui, and Tamara L. Berg et al. "Visual to Sound Generating Natural Sound for Videos in the Wild." In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018.
Kan Chen, Chuanxi Zhang, Chen Fang, Zhaowen Wang, Trung Bui, and Ram Nevatia, et al. "Visually Indicated Sound Generation by Perceptually Optimized Classification." In Proceedings of the International Conference on Multimodal Learning (MULA), 2018.
Hang Zhou, Ziwei Liu, Xudong Xu, Ping Luo, Xiaogang Wang, et al. "Vision-Infused Deep Audio Inpainting." In Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2019.
Chuang Gan, Deng Huang, Peihao Chen, Joshua B. Tenenbaum,and Antonio Torralba et al. "Foley Music: Learning to Generate Music from Videos." In Proceedings of the European Conference on Computer Vision (ECCV), 2020.
Huadong Tan, Guang Wu, Pengcheng Zhao, Yanxiang Chen, et al. "Spectrogram Analysis Via Self-Attention for Realizing Cross-Model Visual-Audio Generation." In Proceedings of the International Conference on Acoustics, Speech, and Signal Processing (ICASSP) 2020.
A. Sophia Koepke, Olivia Wiles, Yael Moses, Andrew Zisserman et al. "Sight to Sound: An End-to-End Approach for Visual Piano Transcription." In Proceedings of the International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2020.
Peihao Chen, Yang Zhang, Mingkui Tan, Hongdong Xiao, Deng Huang, and Chuang Gan et al. "Generating Visually Aligned Sound from Videos." In Proceedings of the Institute of Electrical and Electronics Engineers (IEEE) 2020.
Sanchita Ghose, John J. Prevost, et al. "Auto Foley: Artificial Synthesis of Synchronized Sound Tracks for Silent Videos with Deep Learning." In Proceedings of the Institute of Electrical and Electronics Engineers (IEEE) 2020.
Kun Su, Xiulong Liu, Eli Shlizerman, et al. " Audeo: Audio Generation for a Silent Performance Video." In Proceedings of the Conference on Neural Information Processing Systems (NeurIPS)2020.
Sanchita Ghose,John J. Prevost et al. "Enabling an IoT System of Systems through Auto Sound Synthesis in Silent Video with DNN." In Proceedings of the SoSE 2020 • IEEE 15th International Conference of System of Systems Engineering (SoSE) 2020.
Katashi Nagao,Kaho Kumon,Kodai Hattori。et al. "Impact Sound Generation for Audiovisual Interaction with Real-World Movable Objects in Building-Scale Virtual Reality." In Proceedings of the Applied science (AS) 2021.
Vladimir Iashin, Esa Rahtu, et al. "Taming Visually Guided Sound Generation." (BMVC) 2021.
Sanchita Ghose, John J. Prevost, et al. "Foley GAN: Visually Guided Generative Adversarial Network-Based Synchronous Sound Generation in Silent Videos." In Proceedings of the Institute of Electrical and Electronics Engineers (IEEE) 2021.
Yuexi Du, Ziyang Chen, Justin Salamon, Bryan Russell, Andrew Owens, et al. "Conditional Generation of Audio from Video via Foley Analogies." In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 2023.
Ludan Ruan, Yiyang Ma, Huan Yang, Huiguo He, Bei Liu, Jianlong Fu, Nicholas Jing Yuan, Qin Jin, Baining Guo. "MM-Diffusion Learning Multi-Modal Diffusion Models for Joint Audio and Video Generation" In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 2023.
Downloads
Published
Conference Proceedings Volume
Section
License

This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.