Retrieving Semantics from the Deep: an RAG Solution for Gesture Synthesis

TLDR: Our RAG-Gesture approach produces semantically meaningful co-speech gestures using Retrieval Augmented Diffusion-based Generation.

Please unmute to follow speech.

Paper Video Code

Abstract

Non-verbal communication often comprises of semantically rich gestures that help convey the meaning of an utterance. Producing such semantic co-speech gestures has been a major challenge for the existing neural systems that can generate rhythmic beat gestures, but struggle to produce semantically meaningful gestures. Therefore, we present RAG-Gesture, a diffusion-based gesture generation approach that leverages Retrieval Augmented Generation (RAG) to produce natural-looking and semantically rich gestures. Our neuro-explicit gesture generation approach is designed to produce semantic gestures grounded in interpretable linguistic knowledge. We achieve this by using explicit domain knowledge to retrieve exemplar motions from a database of co-speech gestures. Once retrieved, we then inject these semantic exemplar gestures into our diffusion-based gesture generation pipeline using DDIM inversion and retrieval guidance at the inference time without any need of training. Further, we propose a control paradigm for guidance, that allows the users to modulate the amount of influence each retrieval insertion has over the generated sequence. Our comparative evaluations demonstrate the validity of our approach against recent gesture generation approaches. The reader is urged to watch the supplementary video.

Method Overview. Please unmute to follow Narration.

Method Explanation

Retrieval Algorithms

Our approach uses retrieval algorithms to fetch exemplar gestures from a database of co-speech gestures.
Here is an explanation of how they work:

Please unmute to follow Narration.

Inference-time Semantic Guidance Mechanism

Our approach uses Latent Initialization through DDIM inversion and Retrieval Guidance to inject semantic exemplar gestures into the diffusion-based gesture generation pipeline. In the following video, we explain the process:

Please unmute to follow Narration.

Visualizing Retrieval Augmented Generations

We identify semantically important sections of speech and augment the generated gestures through retrieved examples (visualized in red).

Please unmute to follow speech.

❮ ❯

Click on buttons (left and right of video) to see more examples.

Qualitative Comparisons

Please unmute to follow Narration.

Ablations & Analysis

Please unmute to follow Narration.

Citation

@InProceedings{mughal2024raggesture,
	title = {Retrieving Semantics from the Deep: an RAG Solution for Gesture Synthesis},
	author = {M. Hamza Mughal and Rishabh Dabral and Merel C. J. Scholman and Vera Demberg and Christian Theobalt},
	booktitle={Computer Vision and Pattern Recognition (CVPR)},
	year={2025}
}

Acknowledgement

This work was funded by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) -- GRK 2853/1 “Neuroexplicit Models of Language, Vision, and Action” - project number 471607914. The third author of this study (MS) was supported through the NWO-funded project “Searching for meaning: Identifying and interpreting alternative discourse relation signals in multi-modal language comprehension” (VI.Veni.231C.021).

Contact

For questions, clarifications, please get in touch with:
M. Hamza Mughal (mmughal-(at)-mpi-inf.mpg.de)