Real-time Free-view Human Rendering from Sparse-view RGB Videos using Double Unprojected Textures




1Max Planck Institute for Informatics, Saarland Informatics Campus

2EPFL

We propose Double Unprojected Textures (DUT), a new method to synthesize photoreal 4K novel-view renderings in real-time. Our method consistently beats baseline approaches in terms of rendering quality and inference speed. Moreover, it generalizes to, both, in-distribution (IND) motions, i.e. dancing, and out-of-distribution (OOD) motions, i.e. standing long jump.


Abstract

Real-time free-view human rendering from sparse-view RGB inputs is a challenging task due to the sensor scarcity and the tight time budget. To ensure efficiency, recent methods leverage 2D CNNs operating in texture space to learn rendering primitives. However, they either jointly learn geometry and appearance, or completely ignore sparse image information for geometry estimation, significantly harming visual quality and robustness to unseen body poses. To address these issues, we present Double Unprojected Textures, which at the core disentangles coarse geometric deformation estimation from appearance synthesis, enabling robust and photorealistic 4K rendering in real-time. Specifically, we first introduce a novel image-conditioned template deformation network, which estimates the coarse deformation of the human template from a first unprojected texture. This updated geometry is then used to apply a second and more accurate texture unprojection. The resulting texture map has fewer artifacts and better alignment with input views, which benefits our learning of finer-level geometry and appearance represented by Gaussian splats. We validate the effectiveness and efficiency of the proposed method in quantitative and qualitative experiments, which significantly surpasses other state-of-the-art methods.

Main Video

Method

Given sparse-view images and respective motion, DUT predicts coarse template geometry and fine-grained 3D Gaussians. We first unproject images onto the posed template to obtain a texture map, which is fed into GeoNet to estimate deformations of the template in canonical pose. We then unproject images again onto the posed and deformed template to obtain a less-distorted texture map, which serves as input to our GauNet estimating 3D Gaussian parameters and undergoes scale refinement before splatting.

Out-of-distribution Motion

Texture Editing

Interective Demo

Citation

@article{sun2024real,
title = {Real-time Free-view Human Rendering from Sparse-view RGB Videos using Double Unprojected Textures},
author = {Sun, Guoxing and Dabral, Rishabh and Zhu, Heming and Fua, Pascal and Theobalt, Christian and Habermann, Marc},
year = {2024},
journal={arXiv preprint arXiv:2412.13183}
}

Acknowledgement

We thank Oleksandr Sotnychenko and Pranay Raj Kamuni for their help in data collection and processing, and Kunwar Maheep Singh for his help in segmentation.