GIGA: Generalizable Sparse Image-driven Gaussian Humans

3DV 2026

1Max Planck Institute for Informatics, Saarland Informatics Campus

2Saarbrücken Research Center for Visual Computing, Interaction and AI

3Google, Switzerland

Demo

Dynamic GIGA outputs (left) are driven with 4 input views (right), with apperance changes propagated from the input images.

What’s inside

GIGA pipeline.

GIGA creates texel-aligned 3D Gaussians from sparse (1-4) input views IkI_k and a body template (θ,β,ψ)(\boldsymbol{\theta}, \boldsymbol{\beta}, \boldsymbol{\psi}). It computes RGB texture Tuv\mathbf{T}_{\mathrm{uv}} and canonical position maps Tx0\mathbf{T}_{\mathbf{x}_0} as character-specific inputs. Separate appearance (Ea\mathcal{E}_{\mathrm{a}}) and geometry (Eg\mathcal{E}_{\mathrm{g}}) encoders process these inputs. Both encoders use cross-attention for conditioning on the observed character pose embedding ym\mathbf{y}_{\mathrm{m}}, with motion embedding serving as context and combined encoder outputs (Fuva,Fuvg\mathbf{F}^{\mathrm{a}}_{\mathrm{uv}}, \mathbf{F}^{\mathrm{g}}_{\mathrm{uv}}) as query. Multiple decoders (Da,Dp,Dg\mathcal{D}_{\mathrm{a}}, \mathcal{D}_{\mathrm{p}}, \mathcal{D}_{\mathrm{g}}) generate the final texel-aligned 3D Gaussian avatar, taking into account intermediate featur maps from the encoders, propagated through skip-connections (colored dashed lines). The final representation is articulated with linear blend skinning.

Abstract

Driving a high-quality and photorealistic full-body virtual human from a few RGB cameras is a challenging problem that has become increasingly relevant with emerging virtual reality technologies. A promising solution to democratize such technology would be a generalizable method that takes sparse multi-view images of any person and then generates photoreal free-view renderings of them. However, the state-of-the-art approaches are not scalable to very large datasets and, thus, lack diversity and photorealism. To address this problem, we propose GIGA, a novel, generalizable full-body model for rendering photoreal humans in free viewpoint, driven by a single-view or sparse multi-view video. Notably, GIGA can scale training to a few thousand subjects while maintaining high photorealism and synthesizing dynamic appearance. At the core, we introduce a MultiHeadUNet architecture, which takes an approximate RGB texture accumulated from a single or multiple sparse views and predicts 3D Gaussian primitives represented as 2D texels on top of a human body mesh. At test time, our method performs novel view synthesis of a virtual 3D Gaussian-based human from 1 to 4 input views and a tracked body template for unseen identities. Our method excels over prior works by a significant margin in terms of identity generalization capability and photorealism.

MVHumanNet
DNA-Rendering
Each novel identity is produced by GIGA at test time with 4 input views and no additional training.

BibTeX citation

@article{zubekhin2025giga,
title={GIGA: Generalizable Sparse Image-driven Gaussian Humans},
author={Zubekhin, Anton and Zhu, Heming and Gotardo, Paulo and Beeler, Thabo  and Habermann, Marc and Theobalt, Christian},
year={2025},
journal={arXiv},
eprint={2504.07144},
}