Taming Multimodal Joint Training for High-Quality
Video-to-Audio Synthesis

CVPR 2025

1University of Illinois Urbana-Champaign
2Sony AI
3Sony Group Corporation




TL;DR

MMAudio generates synchronized audio given video and/or text inputs.




Demo