GIGA: Generalizable Sparse Image-driven Gaussian Avatars

What’s inside

GIGA creates 3D Gaussian avatars from 4 input views $I_k$ and a body template $(\boldsymbol{\theta}, \boldsymbol{\beta}, \boldsymbol{\psi})$ . It computes RGB texture $\mathbf{T}_{\mathrm{uv}}$ and canonical position maps $\mathbf{T}_{\mathbf{x}_0}$ as character-specific inputs. Separate appearance ( $\mathcal{E}_{\mathrm{a}}$ ) and geometry ( $\mathcal{E}_{\mathrm{g}}$ ) encoders process these inputs. Both encoders use cross-attention for conditioning on the observed character pose embedding $\mathbf{y}_{\mathrm{m}}$ , with motion embedding serving as context and combined encoder outputs ( $\mathbf{F}^{\mathrm{a}}_{\mathrm{uv}}, \mathbf{F}^{\mathrm{g}}_{\mathrm{uv}}$ ) as query. Multiple decoders ( $\mathcal{D}_{\mathrm{a}}, \mathcal{D}_{\mathrm{p}}, \mathcal{D}_{\mathrm{g}}$ ) generate the final texel-aligned 3D Gaussian avatar, taking into account intermediate featur maps from the encoders, propagated through skip-connections (colored dashed lines). The avatar is articulated with linear blend skinning.

Abstract

Driving a high-quality and photorealistic full-body human avatar, from only a few RGB cameras, is a challenging problem that has become increasingly relevant with emerging virtual reality technologies. To democratize such technology, a promising solution may be a generalizable method that takes sparse multi-view images of an unseen person and then generates photoreal free-view renderings of such identity. However, the current state of the art is not scalable to very large datasets and, thus, lacks in diversity and photorealism. To address this problem, we propose a novel, generalizable full-body model for rendering photoreal humans in free viewpoint, as driven by sparse multi-view video. For the first time in literature, our model can scale up training to thousands of subjects while maintaining high photorealism. At the core, we introduce a MultiHeadUNet architecture, which takes sparse multi-view images in texture space as input and predicts Gaussian primitives represented as 2D texels on top of a human body mesh. Importantly, we represent sparse-view image information, body shape, and the Gaussian parameters in 2D so that we can design a deep and scalable architecture entirely based on 2D convolutions and attention mechanisms. At test time, our method synthesizes an articulated 3D Gaussian-based avatar from as few as four input views and a tracked body template for unseen identities. Our method excels over prior works by a significant margin in terms of cross-subject generalization capability as well as photorealism.

@article{zubekhin2025giga, title={GIGA: Generalizable Sparse Image-driven Gaussian Avatars}, author={Zubekhin, Anton and Zhu, Heming and Gotardo, Paulo and Beeler, Thabo and Habermann, Marc and Theobalt, Christian}, year={2025}, journal={arXiv}, }

GIGA: Generalizable Sparse Image-driven Gaussian Avatars

Demo

Video

What’s inside

Abstract

Animatable Avatars

BibTeX citation