VHOI: Controllable Video Generation of Human-Object Interactions from Sparse Trajectories via Motion Densification

Abstract

Synthesizing realistic human-object interactions (HOI) in video is challenging due to the complex, instance-specific interaction dynamics of both humans and objects. Incorporating controllability in video generation further adds to the complexity. Existing controllable video generation approaches face a trade-off: sparse controls like trajectories are easy to specify but lack instance-awareness, while dense signals such as optical flow, depths or 3D meshes are informative but costly to obtain. We propose VHOI, a two-stage framework that first densifies sparse trajectories into HOI mask sequences, and then fine-tunes a video diffusion model conditioned on these dense masks. We introduce a novel HOI-aware motion representation that uses color encodings to distinguish not only human and object motion, but also body-part-specific dynamics. This design incorporates a human prior into the conditioning signal and strengthens the model’s ability to understand and generate realistic HOI dynamics. Experiments demonstrate state-of-the-art results in controllable HOI video generation. VHOI is not limited to interaction-only scenarios and can also generate full human navigation leading up to object interactions in an end-to-end manner.

Method

Motivated by the importance of instance-aware motion cues for realistic HOI synthesis, VHOI consists of (1) a trajectory augmentor that converts sparse trajectories into dense HOI mask sequences as an intermediate motion representation, and (2) a dense control model that generates the video conditioned on the predicted HOI masks.

Application: User-defined Trajectories

Sample 1

Input: Sparse Trajectories + Image

Trajectory Densification

Generated Sample

Sample 2

Input: Sparse Trajectories + Image

Trajectory Densification

Generated Sample

Citation

@article{zhang2025vhoi,
title = {VHOI: Controllable Video Generation of Human–Object Interactions from Sparse Trajectories via Motion Densification},
author = {Zhang, Wanyue and Foo, Lin Geng, and Dabral, Rishabh and Beeler, Thabo and Theobalt, Christian},
year = {2025},
archivePrefix = {arXiv},
eprint={2512.09646},
}