Trace Anything
Representing Any Video in 4D via Trajectory Fields

Xinhang Liu1,2     Yuxi Xiao1,3     Donny Y. Chen1     Jiashi Feng 1    
Yu-Wing Tai4     Chi-Keung Tang2     Bingyi Kang1    
1Bytedance Seed      2HKUST      3Zhejiang University      4Dartmouth College

TL;DR: We propose a 4D video representation, trajectory field, which maps each pixel across frames to a continuous, parametric 3D trajectory. With a single forward pass, the Trace Anything model efficiently estimates such trajectory fields for any video, image pair, or unstructured image set.

Abstract

Effective spatio-temporal representation is fundamental to modeling, understanding, and predicting dynamics in videos. The atomic unit of a video, the pixel, traces a continuous 3D trajectory over time, serving as the primitive element of dynamics. Based on this principle, we propose representing any video as a Trajectory Field: a dense mapping that assigns a continuous 3D trajectory function of time to each pixel in every frame. With this representation, we introduce Trace Anything, a neural network that predicts the entire trajectory field in a single feed-forward pass. Specifically, for each pixel in each frame, our model predicts a set of control points that parameterizes a trajectory (i.e., a B-spline), yielding its 3D position at arbitrary query time instants. We trained the Trace Anything model on large-scale 4D data, including data from our new platform, and our experiments demonstrate that: (i) Trace Anything achieves state-of-the-art performance on our new benchmark for trajectory field estimation and performs competitively on established point-tracking benchmarks; (ii) it offers significant efficiency gains thanks to its one-pass paradigm, without requiring iterative optimization or auxiliary estimators; and (iii) it exhibits emergent abilities, including goal-conditioned manipulation, motion forecasting, and spatio-temporal fusion.

Trajectory Field

Pixels, the atomic units of video, naturally trace out 3D trajectories in the world, which serve as the fundamental primitive of dynamics. Recognizing this, we propose Trajectory Fields, a versatile 4D representation for any video that associates each pixel in each frame with a parametric 3D trajectory. Unlike prior 4D reconstruction methods that produce disjoint per-frame point clouds and rely on estimated optical flow or 2D tracks to build cross-frame correspondences, Trajectory Fields offer a more direct and compact way to model scene dynamics. Ideally, two conditions should hold:

(C1) Pixels in static regions collapse to degenerate trajectories.

(C2) Corresponding pixels from different frames map to the same 3D trajectory.

Trace Anything Model

Building on this representation, we propose Trace Anything, a feed-forward neural network that estimates trajectory fields directly from video frames. With a single forward pass over all input frames, it predicts a stack of control point maps for each frame, defining spline-based parametric trajectories for every pixel. This one-pass scheme eliminates intermediate estimators and iterative global alignment, predicting all trajectories (per pixel per frame) jointly in a shared world coordinate system. It generalizes across diverse input settings, including monocular videos, image pairs, and unordered photo sets.

Qualitative Results on Trajectory Field Estimation

Our approach naturally accommodates diverse input configurations and estimates the trajectory field in a single forward pass. The inference time is approximately 2.3 seconds for a 30-frame video and 0.2 seconds for an image pair on an NVIDIA A100 GPU.

Choose input setting to view corresponding qualitative results:

Our predictions faithfully capture motions ranging from near-rigid transformations, such as a toy train moving along a track, to highly non-rigid deformations, such as humans or animals in motion. They also handle severe occlusions while preserving the global scene structure.

Our method effectively disentangles static and dynamic components. After Trace Anything predicts control points, we compute the variance over the control-point set associated with each pixel; thresholding this per-pixel variance yields a dynamic mask that cleanly separates static from dynamic regions.

Emergent Capabilities

The Trajectory Field representation and the Trace Anything model exhibit emergent capabilities that competing approaches do not support.

Choose capability to explore:

From a single initial image—and when coupled with a text-conditioned video generator—our model forecasts trajectory fields under natural-language instructions. This enables what-if simulations of scene dynamics and quantitative analysis of how motions evolve over time.

Initial image for motion forecasting

BibTeX


@misc{liu2025traceanythingrepresentingvideo,
      title={Trace Anything: Representing Any Video in 4D via Trajectory Fields}, 
      author={Xinhang Liu and Yuxi Xiao and Donny Y. Chen and Jiashi Feng and Yu-Wing Tai and Chi-Keung Tang and Bingyi Kang},
      year={2025},
      eprint={2510.13802},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2510.13802}, 
}