TriHuman: A Real-time and Controllable Tri-plane Representation for Detailed Human Geometry and Appearance Synthesis

Overview

Creating controllable, photorealistic, and geometrically detailed digital doubles of real humans solely from video data is a key challenge in Computer Graphics and Vision, especially when real-time performance is required. Recent methods attach a neural radiance field (NeRF) to an articulated structure, e.g., a body model or a skeleton, to map points into a pose canonical space while conditioning the NeRF on the skeletal pose. These approaches typically parameterize the neural field with a multi-layer perceptron (MLP) leading to a slow runtime. To address this drawback, we propose TriHuman a novel human-tailored, deformable, and efficient tri-plane representation, which achieves real-time performance, state-of-the-art pose-controllable geometry synthesis as well as photorealistic rendering quality. At the core, we non-rigidly warp global ray samples into our undeformed tri-plane texture space, which effectively addresses the problem of global points being mapped to the same tri-plane locations. We then show how such a tri-plane feature representation can be conditioned on the skeletal motion to account for dynamic appearance and geometry changes. Our results demonstrate a clear step towards higher quality in terms of geometry and appearance modeling of humans and runtime performance.

Pipeline

Figure 2.: Given a skeletal motion and virtual camera view as input, our method generates highly realistic renderings of the human under the specified pose and view. To this end, first a rough motion-dependent and deforming human mesh is regressed. From the deformed mesh, we extract several motion features in texture space, which are then passed through a 3D-aware convolutional architecture to generate a motion-conditioned feature tri-plane. Ray samples in global space can be mapped into a 3D texture cube, which can be then used to sample a feature from the tri-plane. This feature is then passed to a small MLP predicting color and density. Finally, volume rendering and our proposed mesh optimization can generate the geometry and images. Our method is solely supervised on multi-view imagery.

Dataset

Table 1: Statistics of the subjects in our dataset. Hover on the sbujects' names to see the apperances. Note that for each subject, we provide a separate testing sequence to validate the generalization ability of the model.

Name	Length (Train)	Length (Test)	Cameras	Type	Rigged	Masks	GT Meshes	Pose	Hand
Subject0000	19000	6900	54	tight	✓	✓	✓	✓	✕
Subject0003	19000	7000	101	tight	✓	✓	✓	✓	✕
Subject0005	19000	7000	94	loose	✓	✓	✓	✓	✕
Subject0010	33000	7000	116	loose	✓	✓	✓	✓	✓
Subject0021	33000	7000	116	loose	✓	✓	✓	✓	✓
Subject0028	27000	7000	116	tight	✓	✓	✓	✓	✓

Motion Retargeting

TriHuman allows to retarget the motion from the source character from the source character (the first column) to the target character (the remaining columns), while preserving photo-real wrinkles and vivid dynamics. If the videos are not synchronized, please consider to Reload the Page(F5).

Citation

@article{10.1145/3697140,
	author = {Zhu, Heming and Zhan, Fangneng and Theobalt, Christian and Habermann, Marc},
	title = {TriHuman: A Real-time and Controllable Tri-plane Representation for Detailed Human Geometry and Appearance Synthesis},
	year = {2024},
	publisher = {Association for Computing Machinery},
	address = {New York, NY, USA},
	issn = {0730-0301},
	url = {https://doi.org/10.1145/3697140},
	doi = {10.1145/3697140},
	journal = {ACM Trans. Graph.},
	month = sep
}