Overview

Figure. 1: Given skeletal poses and a virtual camera, UMA renders ultra-detailed clothed human appearance and synthesizes high-fidelity geometry. Notably, UMA enables users to digitally zoom in, allowing close inspection of texture details or even fine yarn-level patterns. Additionally, we introduce a new dataset featuring multi-view 6K video recordings, capturing subjects wearing clothing with challenging texture patterns and rich dynamics. The fidelity of the reconstructed avatars makes them particularly suitable for virtual and mixed reality, where users can closely observe fine-grained appearance details.

Pipeline

Pipeline Diagram

Figure 2. UMA takes skeletal motion and the camera view as input and generates high-fidelity geometry and appearance. For avatar representation, to address the stochasticity of the clothing dynamics that cannot be modeled by the skeletal motions, we inject a learnable latent code $\mathbf{z}_f$ (zero latent $\mathbf{z}_{0}$ for testing) into the drivable template $\mathbf{V}_f$. A texel super-resolution module $\mathcal{E}_\mathrm{sr}$ is adopted to densify the animatable Gaussian textures. For multi-level surface alignment, we supervise the surface geometry at both the vertex and texel levels using novel supervision derived from a foundational 2D point tracker. Specifically, the 2D point tracks $\mathbf{P}_{f,c,i}$ between the rasterized and ground-truth images obtained from the tracker are lifted and aggregated into 3D correspondences $\tilde{\mathbf{P}}_{f,i}$ across multiple views using the drivable template $\mathbf{V}_f$.

Dataset

Table 1: Statistics of our high-quality dataset, which features long, 6K multi-view videos captured in a light stage, with challenging motion-aware wrinkles and fine-grained appearance details. Hover on the subjects' names to see their appearances. Note that for each subject, we provide a separate testing sequence to validate the generalization ability of the model.

Name Length (Train) Length (Test) Cameras Rigged Masks GT Meshes Pose Hand SMPL Params
Subject_0 Subject 0 11638 6060 40
Subject_1 Subject 1 15710 8500 40
Subject_2 Subject 2 15240 8180 40
Subject_3 Subject 3 16220 9880 40
Subject_4 Subject 4 17240 8300 40

Main Video (with Narration)

Free Viewpoint Rendering

Detailed Geometry

UMA generates ultra-detailed geometry that captures fine clothing wrinkles, sharing the same triangulation and in correspondence over time.

Motion Retargeting

UMA allows to retarget the motion from the source character (first column) to the target character (second and the third column), while preserving photo-real clothing wrinkles and plausible dynamics. Please enable fullscreen mode for better viewing experience.

Texture Editing

Thanks to the texel-aligned consistent geometry, UMA enables consistent texture editing. Notably, the inserted texture deforms seamlessly with the clothing wrinkles and remains consistently anchored to the characters' original texture.

Citation

							
@article{zhu2025ultra,
  title={UMA: Ultra-detailed Human Avatars via Multi-level Surface Alignment},
  author={Zhu, Heming and Sun, Guoxing and Theobalt, Christian and Habermann, Marc},
  journal={arXiv preprint arXiv:2506.01802},
  year={2025}
}