MUA: Mobile Ultra-detailed Animatable Avatars

Overview

Figure. 1: Given skeletal poses and a virtual camera as inputs, MUA produces photorealistic renderings and detailed geometry of animatable clothed humans. By distilling the ultra-high-quality teacher avatar model, i.e., UMA, into a compact student representation, MUA preserves large-scale clothing dynamics together with fine geometric and appearance details, while reducing computation by three orders of magnitude and achieving over 180 FPS on a personal computer. Moreover, MUA enables real-time on-device inference at 24 FPS on a standalone Meta Quest 3 headset, advancing the practical deployment of highly detailed animatable avatars on VR headsets and other computation-constrained mobile platforms.

On Device Demo

All computations are performed entirely on-device on a standalone Meta Quest 3 headset, with no streaming or offloading to an external server or PC. MUA runs in real time at 24 FPS, demonstrating the feasibility of deploying highly detailed animatable avatars directly on mobile VR hardware. The background sound track is composed by Ian Taylor :))).

Your browser does not support the video tag.

PC Streaming Demo

In this setting, the computation is performed on a desktop PC equipped with a single NVIDIA RTX 3090, and the rendered results are streamed to the Meta Quest 3 headset for display. This configuration leverages the PC's higher compute capacity, allowing MUA to run at 72 or 90 FPS while the headset serves purely as a display client.

Your browser does not support the video tag.

Pipeline

Figure. 2. Given root-normalized skeletal motion $\bar{\boldsymbol{\theta}}_f$ as input, we first train a teacher model that models the coarse geometry with a template mesh $\bar{\mathbf{V}}_{f}$ and fine geometry and appearance with 3D Gaussian splat textures $\mathbf{T}^{\mathrm{gs}}_f$. We further decompose $\mathbf{T}^{\mathrm{gs}}_f$ with a wavelet transform to obtain multi-level supervision for distillation. To derive a compact, mobile-ready representation, we model the coarse geometry $\bar{\mathbf{V}}_{f}$ in a PCA subspace defined in canonical space, with the coefficients predicted by a lightweight MLP $\mathcal{F}_{\mathrm{temp}}$. For high-frequency geometry and appearance, we propose Wavelet-guided Multi-level Factorized Gaussian Textures, which represent the animatable avatar with structured blendshapes under a significantly reduced computational budget.

Citation

							
@misc{
      zhu2026muamobileultradetailedanimatable,
      title={MUA: Mobile Ultra-detailed Animatable Avatars}, 
      author={Heming Zhu and Guoxing Sun and Marc Habermann},
      year={2026},
      eprint={2604.18583},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2604.18583}, 
}