MoFusion: A Framework for Denoising-Diffusion-based Motion Synthesis

Demonstration of MoFusion Framework

Figure 1: Our MoFusion approach synthesises long sequences of human motions in 3D from textual and audio inputs, e.g., by providing music samples; see the rightmost examples). Our model has significantly improved generalisability and realism, and generates longer sequences than previous methods (#N denotes the number of generated frames for each demonstrated motion). Moreover, the resulting dance movements match the rhythm of the conditioning music, even if the latter is outside the training distribution.

Abstract

Conventional methods for human motion synthesis have either been deterministic or have had to struggle with the trade-off between motion diversity vs~motion quality. In response to these limitations, we introduce MoFusion, i.e., a new denoising-diffusion-based framework for high-quality conditional human motion synthesis that can synthesise long, temporally plausible, and semantically accurate motions based on a range of conditioning contexts (such as music and text). We also present ways to introduce well-known kinematic losses for motion plausibility within the motion diffusion framework through our scheduled weighting strategy. We also present ways to introduce well-known kinematic losses for motion plausibility within the motion diffusion framework through our scheduled weighting strategy. The learned latent space can be used for several interactive motion-editing applications like in-betweening, seed-conditioning, and text-based editing, thus, providing crucial abilities for virtual-character animation and robotics. Through comprehensive quantitative evaluations and a perceptual user study, we demonstrate the effectiveness of MoFusion compared to the state-of-the-art on established benchmarks in the literature. We urge the reader to watch our supplementary video.

Video

Download Video: HD (MP4, 68 MB)

Reverse Diffusion Process for Human Motion Synthesis

Download Video: HD

Text-to-Motion Generation

Results

Download Video: HD

Music-to-Dance Generation

Results

Please unmute the audio to hear corresponding music.

Download Video: HD

Seed Conditioned Motion Forecasting

Please unmute the audio to hear corresponding music.

Download Video: HD

Quality Comparison with State-of-the-art

We observe better perceptual quality in dance generation on unseen music as compared to ground truth data and state-of-the-art in Music-to-Dance Generation, despite having higher FID. Note that lower FID doesn't correspond to better synthesis quality as seen in the examples below.

Please unmute the audio to hear corresponding music.

Download Video: HD

Downloads


Citation

BibTeX, 1 KB

@InProceedings{dabral2022mofusion,
      title={MoFusion: A Framework for Denoising-Diffusion-based Motion Synthesis},
      author={Rishabh Dabral and Muhammad Hamza Mughal and Vladislav Golyanik and Christian Theobalt},
      booktitle={Computer Vision and Pattern Recognition (CVPR)},
      year={2023}
}

Contact

For questions, clarifications, please get in touch with:
Rishabh Dabral
rdabral@mpi-inf.mpg.de
Vladislav Golyanik
golyanik@mpi-inf.mpg.de

This page is Zotero translator friendly. Imprint. Data Protection.