Taming Multimodal Joint Training for High-Quality
Video-to-Audio Synthesis

arXiv 2024

1University of Illinois Urbana-Champaign
2Sony AI
3Sony Group Corporation




TL;DR

MMAudio generates synchronized audio given video and/or text inputs.




Demo