We propose Double Unprojected Textures (DUT), a new method to synthesize photoreal 4K novel-view renderings in real-time. Our method consistently beats baseline approaches in terms of rendering quality and inference speed. Moreover, it generalizes to, both, in-distribution (IND) motions, i.e. dancing, and out-of-distribution (OOD) motions, i.e. standing long jump.
Real-time free-view human rendering from sparse-view RGB inputs is a challenging task due to the sensor scarcity and the tight time budget. To ensure efficiency, recent methods leverage 2D CNNs operating in texture space to learn rendering primitives. However, they either jointly learn geometry and appearance, or completely ignore sparse image information for geometry estimation, significantly harming visual quality and robustness to unseen body poses. To address these issues, we present Double Unprojected Textures, which at the core disentangles coarse geometric deformation estimation from appearance synthesis, enabling robust and photorealistic 4K rendering in real-time. Specifically, we first introduce a novel image-conditioned template deformation network, which estimates the coarse deformation of the human template from a first unprojected texture. This updated geometry is then used to apply a second and more accurate texture unprojection. The resulting texture map has fewer artifacts and better alignment with input views, which benefits our learning of finer-level geometry and appearance represented by Gaussian splats. We validate the effectiveness and efficiency of the proposed method in quantitative and qualitative experiments, which significantly surpasses other state-of-the-art methods.
Given sparse-view images and respective motion, DUT predicts coarse template geometry and fine-grained 3D Gaussians. We first unproject images onto the posed template to obtain a texture map, which is fed into GeoNet to estimate deformations of the template in canonical pose. We then unproject images again onto the posed and deformed template to obtain a less-distorted texture map, which serves as input to our GauNet estimating 3D Gaussian parameters and undergoes scale refinement before splatting.
Under the same body pose, the degree of deformations can be reflected by the distortions of undeformed (first) texture map, which offers additional information to solve one-to-many mapping issue in motion-driven deformation methods.
Performing a second texture unprojection using the deformed template leads to less ghosting artifacts and better geometric alignment.
The runtime consists of model inference and novel-view rendering.
[1] Drivable Volumetric Avatars using Texel-Aligned Features (Remelli et al. 2022)
[2] Holoported Characters: Real-time Free-viewpoint Rendering of Humans from Sparse RGB Cameras (Shetty et al. 2024)
Our method supports texture editing of garments in texel space.
Real-time, end to end, interactively controllable. Note that all operations are executed after the raw data inputs. Multi-view RGB streams/videos with paired motions ⇒ Dynamic textures ⇒ 3D Gaussians ⇒ Free-viewpoint renderings.
@article{sun2025real, title = {Real-time Free-view Human Rendering from Sparse-view RGB Videos using Double Unprojected Textures}, author = {Sun, Guoxing and Dabral, Rishabh and Zhu, Heming and Fua, Pascal and Theobalt, Christian and Habermann, Marc}, year = {2025}, month = {June}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}, }
We thank Oleksandr Sotnychenko and Pranay Raj Kamuni for their help in data collection and processing, and Kunwar Maheep Singh for his help in segmentation.