Effective spatio-temporal representation is fundamental to modeling, understanding, and predicting dynamics in videos. The atomic unit of a video, the pixel, traces a continuous 3D trajectory over time, serving as the primitive element of dynamics. Based on this principle, we propose representing any video as a Trajectory Field: a dense mapping that assigns a continuous 3D trajectory function of time to each pixel in every frame. With this representation, we introduce Trace Anything, a neural network that predicts the entire trajectory field in a single feed-forward pass. Specifically, for each pixel in each frame, our model predicts a set of control points that parameterizes a trajectory (i.e., a B-spline), yielding its 3D position at arbitrary query time instants. We trained the Trace Anything model on large-scale 4D data, including data from our new platform, and our experiments demonstrate that: (i) Trace Anything achieves state-of-the-art performance on our new benchmark for trajectory field estimation and performs competitively on established point-tracking benchmarks; (ii) it offers significant efficiency gains thanks to its one-pass paradigm, without requiring iterative optimization or auxiliary estimators; and (iii) it exhibits emergent abilities, including goal-conditioned manipulation, motion forecasting, and spatio-temporal fusion.
Pixels, the atomic units of video, naturally trace out 3D trajectories in the world, which serve as the fundamental primitive of dynamics. Recognizing this, we propose Trajectory Fields, a versatile 4D representation for any video that associates each pixel in each frame with a parametric 3D trajectory. Unlike prior 4D reconstruction methods that produce disjoint per-frame point clouds and rely on estimated optical flow or 2D tracks to build cross-frame correspondences, Trajectory Fields offer a more direct and compact way to model scene dynamics. Ideally, two conditions should hold:
(C1) Pixels in static regions collapse to degenerate trajectories.
(C2) Corresponding pixels from different frames map to the same 3D trajectory.
Building on this representation, we propose Trace Anything, a feed-forward neural network that estimates trajectory fields directly from video frames. With a single forward pass over all input frames, it predicts a stack of control point maps for each frame, defining spline-based parametric trajectories for every pixel. This one-pass scheme eliminates intermediate estimators and iterative global alignment, predicting all trajectories (per pixel per frame) jointly in a shared world coordinate system. It generalizes across diverse input settings, including monocular videos, image pairs, and unordered photo sets.
Our approach naturally accommodates diverse input configurations and estimates the trajectory field in a single forward pass. The inference time is approximately 2.3 seconds for a 30-frame video and 0.2 seconds for an image pair on an NVIDIA A100 GPU.
Choose input setting to view corresponding qualitative results:
Our predictions faithfully capture motions ranging from near-rigid transformations, such as a toy train moving along a track, to highly non-rigid deformations, such as humans or animals in motion. They also handle severe occlusions while preserving the global scene structure.
Our method effectively disentangles static and dynamic components. After Trace Anything predicts control points, we compute the variance over the control-point set associated with each pixel; thresholding this per-pixel variance yields a dynamic mask that cleanly separates static from dynamic regions.
By inferring trajectory fields from image pairs, our model effectively reconstructs the implied spatio-temporal dynamics and interpolates intermediate motion. In the context of robot learning, this naturally aligns with goal-conditioned manipulation, where the predicted trajectories can be interpreted as feasible robot end-effector motions.
Our method also handles unstructured, unordered image sets—a setting not addressed by prior work. These inputs lack both temporal ordering and continuous camera motion, yet our framework is inherently designed to cope with such challenging cases. For clarity, we display the input images in chronological order here, although no sequence information is provided to the model.
The Trajectory Field representation and the Trace Anything model exhibit emergent capabilities that competing approaches do not support.
Choose capability to explore:
From a single initial image—and when coupled with a text-conditioned video generator—our model forecasts trajectory fields under natural-language instructions. This enables what-if simulations of scene dynamics and quantitative analysis of how motions evolve over time.
From an image pair representing the initial and goal states, Trace Anything estimates per-pixel 3D trajectories. Reprojecting these trajectories with the estimated camera poses yields dense 2D motion fields. In the context of robot learning, these 2D/3D trajectories can be interpreted as feasible end-effector motions for goal-conditioned manipulation, providing a continuous, geometry-aware bridge from visual goals to actionable control signals.
In this video, both the camera and the scene itself are in motion. By leveraging the trajectory field predicted by Trace Anything, we can canonically align the dynamics into a single reference frame. This not only enables faithful reconstruction of underlying structures but also disentangles scene motion from viewpoint changes.
Shout-out to these amazing works:
@misc{liu2025traceanythingrepresentingvideo,
title={Trace Anything: Representing Any Video in 4D via Trajectory Fields},
author={Xinhang Liu and Yuxi Xiao and Donny Y. Chen and Jiashi Feng and Yu-Wing Tai and Chi-Keung Tang and Bingyi Kang},
year={2025},
eprint={2510.13802},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2510.13802},
}