The emergence of neural rendering has significantly advanced the rendering quality of 3D human avatars, with the recently popular 3DGS technique enabling real-time performance. However, SMPL-driven 3DGS human avatars still struggle to capture fine appearance details due to the complex mapping from pose to appearance during fitting. In this paper, we propose SeqAvatar, which excavates the explicit 3DGS representation to better model human avatars based on a hierarchical motion context. Specifically, we utilize a coarse-to-fine motion conditions that incorporate both the overall human skeleton and fine-grained vertex motions for non-rigid deformation. To enhance the robustness of the proposed motion conditions, we adopt a spatio-temporal multi-scale sampling strategy to hierarchically integrate more motion clues to model human avatars. Extensive experiments demonstrate that our method significantly outperforms 3DGS-based approaches and renders human avatars orders of magnitude faster than the latest NeRF-based models that incorporate temporal context, all while delivering performance that is at least comparable or even superior.
Overview of the proposed method. We first initialize canonical Gaussian positions with SMPL template vertexes. For each Gaussian, we derive both coarse skeleton motion condition and fine-grained vertex motion condition that sampled from the vertex motion template (points in different colors represent different motions). Based on such hierarchical motion information, we utilize an MLP to better predict each Gaussian's non-rigid deformation prediction. The non-rigid deformed Gaussians are then warped into observation space via the standard LBS transformation for rendering.
@inproceedings{xu2025seqavatar,
title={Sequential Gaussian Avatars with Hierarchical Motion Context},
author={Xu, Wangze and Zhan, Yifan and Zhong, Zhihang and Sun, Xiao},
booktitle={International Conference on Computer Vision (ICCV)},
year={2025},
}