MIBURI: Towards Expressive Interactive Gesture Synthesis

TLDR: Causal & Real-time generation of expressive full-body gestures for interactive Embodied Conversational Agents.

Paper Video Code

Real-time Demonstrations

↓ MIBURI performing expressive online gesture generation ↓

❮ ❯

Abstract

Embodied Conversational Agents (ECAs) aim to emulate human face-to-face interaction through speech, gestures, and facial expressions. Current large language model (LLM)-based conversational agents lack embodiment and the expressive gestures essential for natural interaction. Existing solutions for ECAs often produce rigid, low-diversity motions, that are unsuitable for human-like interaction. Alternatively, generative methods for co-speech gesture synthesis yield natural body gestures but depend on future speech context and require long run-times. To bridge this gap, we present MIBURI, the first online, causal framework for generating expressive full-body gestures and facial expressions synchronized with real-time spoken dialogue. We employ body-part aware gesture codecs that encode hierarchical motion details into multi-level discrete tokens. These tokens are then autoregressively generated by a two-dimensional causal framework conditioned on LLM-based speech-text embeddings, modeling both temporal dynamics and part-level motion hierarchy in real time. Further, we introduce auxiliary objectives to encourage expressive and diverse gestures while preventing convergence to static poses. Comparative evaluations demonstrate that our causal and real-time approach produces natural and contextually aligned gestures against recent baselines. We urge the reader to explore demo and explanation videos.

MIBURI Introduction. Please unmute to follow Narration.

Method Explanation

Our approach presents an online, causal framework for generating expressive full-body gestures and facial expressions in real time. Instead of relying on complex multi-stage animation pipelines, MIBURI directly produces co-speech gestures using semantic and acoustic tokens from a speech-text foundation model. With body-part-aware codecs and a two-dimensional autoregressive scheme, it captures both temporal dynamics and hierarchical motion details.

Please unmute to follow Narration.

Online Generation Results

↓ More results for causal and realtime generation in a demo ↓

❮ ❯

Offline Generation Results on BEAT2

We also demonstrate our method's capability to perform offline gesture synthesis on the BEAT2 multispeaker test set.

Please unmute to follow Narration.

Qualititative Comparison with Non-Causal Baselines

Please unmute to follow Narration.

Citation

@InProceedings{mughal2026miburi,
	title = {MIBURI: Towards Expressive Interactive Gesture Synthesis},
	author = {M. Hamza Mughal and Rishabh Dabral and Vera Demberg and Christian Theobalt},
	booktitle={Computer Vision and Pattern Recognition (CVPR)},
	year={2026}
}

Acknowledgement

This work was funded by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) -- GRK 2853/1 “Neuroexplicit Models of Language, Vision, and Action” - project number 471607914. We also thank Anton Zubekhin & Andrea Boscolo Camiletto for their help with the demo.

Contact

For questions, clarifications, please get in touch with:
M. Hamza Mughal (mmughal-(at)-mpi-inf.mpg.de)