VHOI: Controllable Video Generation of Human–Object Interactions from Sparse Trajectories via Motion Densification


CVPR 2026 Findings


1Max Planck Institute for Informatics, Saarland Informatics Campus,

2Saarbrücken Research Center for Visual Computing, Interaction and AI


3Google

Abstract

Synthesizing realistic human-object interactions (HOI) in video is challenging due to the complex, instance-specific interaction dynamics of both humans and objects. Incorporating controllability in video generation further adds to the complexity. Existing controllable video generation approaches face a trade-off: sparse controls like trajectories are easy to specify but lack instance-awareness, while dense signals such as optical flow, depths or 3D meshes are informative but costly to obtain. We propose VHOI, a two-stage framework that first densifies sparse trajectories into HOI mask sequences, and then fine-tunes a video diffusion model conditioned on these dense masks. We introduce a novel HOI-aware motion representation that uses color encodings to distinguish not only human and object motion, but also body-part-specific dynamics. This design incorporates a human prior into the conditioning signal and strengthens the model’s ability to understand and generate realistic HOI dynamics. Experiments demonstrate state-of-the-art results in controllable HOI video generation. VHOI is not limited to interaction-only scenarios and can also generate full human navigation leading up to object interactions in an end-to-end manner.

Sparse Trajectory Representation


Control via Trajectory Densification


Main Video (With Narration)

Method

Motivated by the importance of instance-aware motion cues for realistic HOI synthesis, VHOI consists of (1) a trajectory augmentor that converts sparse trajectories into dense HOI mask sequences as an intermediate motion representation, and (2) a dense control model that generates the video conditioned on the predicted HOI masks.

Trajectory Augmentor

The trajectory augmentor receives sparse trajectories and the corresponding visibility maps (optional) as inputs. The trajectories are processed by a trajectory extractor and fused with transformer latents and visibility cues in the augmentor fuser, producing a sequence of HOI masks that densifies the sparse control signals, used in the dense control model. Orange modules denote learnable components; blue modules are frozen.

Dense Model

The dense control model conditions on HOI masks. The masks are encoded by a HOI extractor and fused with transformer latents in the dense control fuser, which also includes a confidence prediction head to modulate reliance on the control signal. The final output is an HOI video that follows the densified motion cues. Orange modules denote learnable components; blue modules are frozen.

Application: User-defined Trajectories

Sample 1

Sparse trajectory input

Input: Sparse Trajectories + Image

Trajectory Densification

Generated Sample

Sample 2

Sparse trajectory input

Input: Sparse Trajectories + Image

Trajectory Densification

Generated Sample

Citation

@article{zhang2025vhoi,
title = {VHOI: Controllable Video Generation of Human–Object Interactions from Sparse Trajectories via Motion Densification},
author = {Zhang, Wanyue and Foo, Lin Geng, and Dabral, Rishabh and Beeler, Thabo and Theobalt, Christian},
year = {2025},
archivePrefix = {arXiv},
eprint={2512.09646},
}