FRAME: Floor-aligned Representation for Avatar Motion from Egocentric Video

TLDR: We leverage VR headset 6D tracking to improve motion capture from body-facing cameras.

Interactive Data Samples

Abstract

Egocentric motion capture with a head-mounted body-facing stereo camera is crucial for VR and AR applications but presents significant challenges such as heavy occlusions and limited annotated real-world data. Existing methods rely on synthetic pretraining and struggle to generate smooth and accurate predictions in real-world settings, particularly for lower limbs. Our work addresses these limitations by introducing a lightweight VR-based data collection setup with on-board, real-time 6D pose tracking. Using this setup, we collected the most extensive real-world dataset for ego-facing ego-mounted cameras to date in size and motion variability. Effectively integrating this multimodal input — device pose and camera feeds — is challenging due to the differing characteristics of each data source. To address this, we propose FRAME, a simple yet effective architecture that combines device pose and camera feeds for state-of-the-art body pose prediction through geometrically sound multimodal integration and can run at 300 FPS on modern hardware. Lastly, we showcase a novel training strategy to enhance the model’s generalization capabilities. Our approach exploits the problem’s geometric properties, yielding high-quality motion capture free from common artifacts in prior works. Qualitative and quantitative evaluations, along with extensive comparisons, demonstrate the effectiveness of our method.

BibTeX citation

    @article{boscolo2025frame,
title = {FRAME: Floor-aligned Representation for Avatar Motion from Egocentric Video},
author = {Boscolo Camiletto, Andrea and Wang, Jian and Alvarado, Eduardo and Dabral, Rishabh and Beeler, Thabo and Habermann, Marc and Theobalt, Christian},
year = {2025},
journal={CVPR},
}

Interactive Data Samples

Summary Video

Abstract

BibTeX citation