Immersive VR telepresence ideally means being able to interact and communicate with digital avatars that are indistinguishable from and precisely reflect the behaviour of their real counterparts. The core technical challenge is two fold: Creating a digital double that faithfully reflects the real human and tracking the real human solely from egocentric sensing devices that are lightweight and have a low energy consumption, e.g. a single RGB camera. Up to date, no unified solution to this problem exists as recent works solely focus on egocentric motion capture, only model the head, or build avatars from multi-view captures. In this work, we, for the first time in literature, propose a person-specific egocentric telepresence approach, which jointly models the photoreal digital avatar while also driving it from a single egocentric video. We first present a character model that is animatible, i.e. can be solely driven by skeletal motion, while being capable of modeling geometry and appearance. Then, we introduce a personalized egocentric motion capture component, which recovers full-body motion from an egocentric video. Finally, we apply the recovered pose to our character model and perform a test-time mesh refinement such that the geometry faithfully projects onto the egocentric view. To validate our design choices, we propose a new and challenging benchmark, which provides paired egocentric and dense multi-view videos of real humans performing various motions. Our experiments demonstrate a clear step towards egocentric and photoreal telepresence as our method outperforms baselines as well as competing methods.
Taking as input a single egocentric RGB video, we first detect the skeletal pose in form of 3D keypoints and then solve for the skeleton parameters, i.e. joint angles, using our IKSolver. The motion signal drives the mesh-based avatar via our MotionDeformer that is pre-trained on multi-view videos of the actor performing various motions. At inference time, our EgoDeformer further enhances the egocentric view alignment of the predicted avatar. Finally, our GaussianPredictor generates dynamic Gaussian parameters in the UV space of the character's mesh, which model the motion- and view-dependent appearance of the avatar. Given the recovered Gaussian parameters representing our character, we can render free viewpoint videos of the avatar that is solely driven from an egocentric RGB video of the real human using Gaussian splatting.
@article{chen2024egoavatar, title = {EgoAvatar: Egocentric View-Driven and Photorealistic Full-body Avatars.}, author = {Chen, Jianchun and Wang, Jian and Zhang, Yinda and Pandey, Rohit and Beeler, Thabo and Habermann, Marc and Theobalt, Christian}, booktitle={SIGGRAPH Asia 2024 Conference Papers}, pages={1--11}, year={2024} }
This project was supported by the ERC Consolidator Grant 4DReply (770784) and the Saarbrücken Research Center for Visual Computing, Interaction, and AI. We would like to thank the anonymous reviewers for constructive comments and suggestions, and Guoxing Sun for his help in implementing forward/inverse kinematics.