UMA: Ultra-detailed Human Avatars via Multi-level Surface Alignment

Overview

Figure. 1: Given skeletal poses and a virtual camera, UMA renders ultra-detailed clothed human appearance and synthesizes high-fidelity geometry. Notably, UMA enables users to digitally zoom in, allowing close inspection of texture details or even fine yarn-level patterns. Additionally, we introduce a new dataset featuring multi-view 6K video recordings, capturing subjects wearing clothing with challenging texture patterns and rich dynamics. The fidelity of the reconstructed avatars makes them particularly suitable for virtual and mixed reality, where users can closely observe fine-grained appearance details.

Pipeline

Figure 2. UMA takes skeletal motion and the camera view as input and generates high-fidelity geometry and appearance. For avatar representation, to address the stochasticity of the clothing dynamics that cannot be modeled by the skeletal motions, we inject a learnable latent code $\mathbf{z}_f$ (zero latent $\mathbf{z}_{0}$ for testing) into the drivable template $\mathbf{V}_f$. A texel super-resolution module $\mathcal{E}_\mathrm{sr}$ is adopted to densify the animatable Gaussian textures. For multi-level surface alignment, we supervise the surface geometry at both the vertex and texel levels using novel supervision derived from a foundational 2D point tracker. Specifically, the 2D point tracks $\mathbf{P}_{f,c,i}$ between the rasterized and ground-truth images obtained from the tracker are lifted and aggregated into 3D correspondences $\tilde{\mathbf{P}}_{f,i}$ across multiple views using the drivable template $\mathbf{V}_f$.

Dataset

Table 1: Statistics of our high-quality dataset, which features long, 6K multi-view videos captured in a light stage, with challenging motion-aware wrinkles and fine-grained appearance details. Hover on the subjects' names to see their appearances. Note that for each subject, we provide a separate testing sequence to validate the generalization ability of the model.

Name	Length (Train)	Length (Test)	Cameras	Rigged	Masks	GT Meshes	Pose	Hand	SMPL Params
Subject_0	11638	6060	40	✓	✓	✓	✓	✓	✓
Subject_1	15710	8500	40	✓	✓	✓	✓	✓	✓
Subject_2	15240	8180	40	✓	✓	✓	✓	✓	✓
Subject_3	16220	9880	40	✓	✓	✓	✓	✓	✓
Subject_4	17240	8300	40	✓	✓	✓	✓	✓	✓

Motion Retargeting

UMA allows to retarget the motion from the source character (first column) to the target character (second and the third column), while preserving photo-real clothing wrinkles and plausible dynamics. Please enable fullscreen mode for better viewing experience.

Texture Editing

Thanks to the texel-aligned consistent geometry, UMA enables consistent texture editing. Notably, the inserted texture deforms seamlessly with the clothing wrinkles and remains consistently anchored to the characters' original texture.

							
@article{zhu2025ultra,
  title={UMA: Ultra-detailed Human Avatars via Multi-level Surface Alignment},
  author={Zhu, Heming and Sun, Guoxing and Theobalt, Christian and Habermann, Marc},
  journal={arXiv preprint arXiv:2506.01802},
  year={2025}
}