Trace Anything
Representing Any Video in 4D via Trajectory Fields

Xinhang Liu^1,2 Yuxi Xiao^1,3 Donny Y. Chen¹ Jiashi Feng ¹
Yu-Wing Tai⁴ Chi-Keung Tang² Bingyi Kang¹

¹Bytedance Seed ²HKUST ³Zhejiang University ⁴Dartmouth College

arXiv Video Interactive Results Code Model

TL;DR: We propose a 4D video representation, trajectory field, which maps each pixel across frames to a continuous, parametric 3D trajectory. With a single forward pass, the Trace Anything model efficiently estimates such trajectory fields for any video, image pair, or unstructured image set.

Abstract

Effective spatio-temporal representation is fundamental to modeling, understanding, and predicting dynamics in videos. The atomic unit of a video, the pixel, traces a continuous 3D trajectory over time, serving as the primitive element of dynamics. Based on this principle, we propose representing any video as a Trajectory Field: a dense mapping that assigns a continuous 3D trajectory function of time to each pixel in every frame. With this representation, we introduce Trace Anything, a neural network that predicts the entire trajectory field in a single feed-forward pass. Specifically, for each pixel in each frame, our model predicts a set of control points that parameterizes a trajectory (i.e., a B-spline), yielding its 3D position at arbitrary query time instants. We trained the Trace Anything model on large-scale 4D data, including data from our new platform, and our experiments demonstrate that: (i) Trace Anything achieves state-of-the-art performance on our new benchmark for trajectory field estimation and performs competitively on established point-tracking benchmarks; (ii) it offers significant efficiency gains thanks to its one-pass paradigm, without requiring iterative optimization or auxiliary estimators; and (iii) it exhibits emergent abilities, including goal-conditioned manipulation, motion forecasting, and spatio-temporal fusion.

Trajectory Field

Pixels, the atomic units of video, naturally trace out 3D trajectories in the world, which serve as the fundamental primitive of dynamics. Recognizing this, we propose Trajectory Fields, a versatile 4D representation for any video that associates each pixel in each frame with a parametric 3D trajectory. Unlike prior 4D reconstruction methods that produce disjoint per-frame point clouds and rely on estimated optical flow or 2D tracks to build cross-frame correspondences, Trajectory Fields offer a more direct and compact way to model scene dynamics. Ideally, two conditions should hold:

(C1) Pixels in static regions collapse to degenerate trajectories.

(C2) Corresponding pixels from different frames map to the same 3D trajectory.

Trace Anything Model

Building on this representation, we propose Trace Anything, a feed-forward neural network that estimates trajectory fields directly from video frames. With a single forward pass over all input frames, it predicts a stack of control point maps for each frame, defining spline-based parametric trajectories for every pixel. This one-pass scheme eliminates intermediate estimators and iterative global alignment, predicting all trajectories (per pixel per frame) jointly in a shared world coordinate system. It generalizes across diverse input settings, including monocular videos, image pairs, and unordered photo sets.

Qualitative Results on Trajectory Field Estimation

Our approach naturally accommodates diverse input configurations and estimates the trajectory field in a single forward pass. The inference time is approximately 2.3 seconds for a 30-frame video and 0.2 seconds for an image pair on an NVIDIA A100 GPU.

Choose input setting to view corresponding qualitative results:

Our predictions faithfully capture motions ranging from near-rigid transformations, such as a toy train moving along a track, to highly non-rigid deformations, such as humans or animals in motion. They also handle severe occlusions while preserving the global scene structure.

Our method effectively disentangles static and dynamic components. After Trace Anything predicts control points, we compute the variance over the control-point set associated with each pixel; thresholding this per-pixel variance yields a dynamic mask that cleanly separates static from dynamic regions.

By inferring trajectory fields from image pairs, our model effectively reconstructs the implied spatio-temporal dynamics and interpolates intermediate motion. In the context of robot learning, this naturally aligns with goal-conditioned manipulation, where the predicted trajectories can be interpreted as feasible robot end-effector motions.

Our method also handles unstructured, unordered image sets—a setting not addressed by prior work. These inputs lack both temporal ordering and continuous camera motion, yet our framework is inherently designed to cope with such challenging cases. For clarity, we display the input images in chronological order here, although no sequence information is provided to the model.

Emergent Capabilities

The Trajectory Field representation and the Trace Anything model exhibit emergent capabilities that competing approaches do not support.

Choose capability to explore:

From a single initial image—and when coupled with a text-conditioned video generator—our model forecasts trajectory fields under natural-language instructions. This enables what-if simulations of scene dynamics and quantitative analysis of how motions evolve over time.

From an image pair representing the initial and goal states, Trace Anything estimates per-pixel 3D trajectories. Reprojecting these trajectories with the estimated camera poses yields dense 2D motion fields. In the context of robot learning, these 2D/3D trajectories can be interpreted as feasible end-effector motions for goal-conditioned manipulation, providing a continuous, geometry-aware bridge from visual goals to actionable control signals.

In this video, both the camera and the scene itself are in motion. By leveraging the trajectory field predicted by Trace Anything, we can canonically align the dynamics into a single reference frame. This not only enables faithful reconstruction of underlying structures but also disentangles scene motion from viewpoint changes.

Related Work

Shout-out to these amazing works:

BibTeX


@misc{liu2025traceanythingrepresentingvideo,
      title={Trace Anything: Representing Any Video in 4D via Trajectory Fields}, 
      author={Xinhang Liu and Yuxi Xiao and Donny Y. Chen and Jiashi Feng and Yu-Wing Tai and Chi-Keung Tang and Bingyi Kang},
      year={2025},
      eprint={2510.13802},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2510.13802}, 
}

Trace Anything Representing Any Video in 4D via Trajectory Fields