Method Explanation
Our approach presents an online, causal framework for generating expressive full-body gestures and facial expressions in real time. Instead of relying on complex multi-stage animation pipelines, MIBURI directly produces co-speech gestures using semantic and acoustic tokens from a speech-text foundation model. With body-part-aware codecs and a two-dimensional autoregressive scheme, it captures both temporal dynamics and hierarchical motion details.

