XNect Demo(v2): Real-time Multi-person 3D Human
Pose Estimation with a Single RGB Camera

CVPR 2019, Long Beach, USA

Abstract

We present a real-time multi-person 3D human body pose estimation system which makes use of a single RGB camera for human motion capture in general scenes. Our learning based approach gives full body 3D articulation estimates even under strong partial occlusion, as well as estimates of camera relative localization in space. Our approach makes use of a 3 stage design. The first stage processes the complete input frame at each time step using a novel efficient convolutional network architecture. It predicts the 2D body joint locations and their associations, and an intermediate 3D representation per body part for all subjects. The second stage uses a fully connected network to predict the 3D pose per subject using the predicted 2D pose and the intermediate 3D representation. A skeleton fitting step further reconciles the 2D and 3D predictions, and together with floor-plane calibration can be used to accurately localize subjects in the scene. Our approach is trained using our recently proposed Multi-person Composited 3D Human Pose (MuCo-3DHP) dataset, and also leverages MS-COCO person keypoints dataset for improved performance in general scenes. Our system can handle an arbitrary number of people in the scene, and processes complete frames without requiring prior person detection.

Real-time Demo Examples

Contact

Dushyant Mehta
dmehta@mpi-inf.mpg.de