Project Page

Learning Task-Centric World Models from Visual Foundations

Turning frozen foundation embeddings into compact, task-centric latent dynamics for downstream reward-free offline planning and control.

Minghao Fu1   Fan Feng1   Nicklas Hansen1   Biwei Huang1

1University of California, San Diego

Abstract

World models enable agents to predict future dynamics conditioned on actions, making the choice of latent representation central to planning and control. Such representations are often either learned directly from pixels with limited semantic structure or inherited from frozen visual foundation models with excessive task-irrelevant detail, yielding state spaces that are poorly matched to downstream planning and control. This is especially challenging in reward-free offline settings, where the model must learn from fixed trajectories without reward supervision or online interaction.

To address this, we propose TC-WM, a framework for turning foundation-model embeddings into compact, task-sufficient world representations. The key design is to treat the pretrained embedding space as a semantic scaffold rather than as the final state space: TC-WM linearly projects high-dimensional visual embeddings into a compact latent as the dynamic space, aligns a subspace with the agent's physical state via contrastive learning, and reconstructs embeddings to preserve useful visual structure. This combines the generality of foundation features with the controllability of task-centric dynamics. Theoretically, we show that TC-WM suffices to identify the task-centric latent factors up to a simple transformation. Empirically, TC-WM enables test-time planning across diverse environments (e.g., Robomimic and D4RL), achieving better world-modeling quality and more precise control than state-of-the-art approaches.

Comparison of world-model paradigms (a)-(c) and TC-WM (d)-(e)

Figure 1: Comparison of world-model paradigms. (a) Generative WM (MDN-RNN, IRIS) — predict pixels directly. (b) Latent WM (TD-MPC, MuZero) — predict in a learned latent. (c) Embedding WM (DINO-WM, V-JEPA) — predict in a frozen foundation embedding. (d) TC-WM dynamics — predict in a compact latent inside the embedding. (e) Task-centric structure — align zs with proprioception, anchor zc via embedding reconstruction.

Robomimic highlight: success rate, latent-rollout MSE, linear probing on Lift and Can, and anti-collapse

Figure 2: Robomimic highlight. (1) Success rate. (2) Latent-rollout MSE. (3–4) Linear probes on Lift and Can. (5) TC-WM avoids the latent collapse seen when rolling out directly on foundation embeddings.

Environments

Nine offline visual-control tasks: navigation, locomotion, and manipulation — trained on images + actions + proprioception, no reward.

Maze trajectory Maze
Wall trajectory Wall
Push-T trajectory Push-T
Robomimic Lift trajectory Lift
Robomimic Can trajectory Can
Robomimic Square trajectory Square
DMC Reacher trajectory Reacher
DMC Cheetah trajectory Cheetah
DMC Hopper trajectory Hopper

Method

TC-WM architecture: linear projection of joint visual and proprioception embeddings into a compact latent, with task-centric alignment, dynamics, and embedding reconstruction

Architecture. A frozen visual backbone is linearly projected into a compact latent; a designated subspace is aligned with proprioception via InfoNCE; a ViT predicts latent dynamics; a linear decoder reconstructs the embedding to prevent collapse. Under partial alignment, the task-centric block is identifiable up to an affine map.

Results

World model prediction accuracy across nine environments

World-model prediction. Lowest latent-prediction error on nearly all tasks; competitive image reconstruction. Lower is better.

Planning performance under CEM and LDP

Planning. CEM on Maze / Wall / Push-T / Cheetah / Hopper; LDP on Lift / Can / Square. Only method that surpasses DINO-WM on every LDP task.

Clean observationclean input
TC-WM rollout from cleanTC-WM pred (clean)
Perturbed observationnoisy input
TC-WM rollout from perturbedTC-WM pred (noisy)

Lift  ·  Gaussian noise + color jitter

Unseen perturbations. Under additive Gaussian noise and per-channel color jitter never seen during training, TC-WM's open-loop rollout preserves cube position and manipulator silhouette at near-clean fidelity.

Open-Loop Rollout

Lift originaloriginal
Lift reconstructionreconstruction
Lift TC-WM rolloutrollout

Lift

Can originaloriginal
Can reconstructionreconstruction
Can TC-WM rolloutrollout

Can

Square originaloriginal
Square reconstructionreconstruction
Square TC-WM rolloutrollout

Square

Wall originaloriginal
Wall reconstructionreconstruction
Wall TC-WM rolloutrollout

Wall

Maze originaloriginal
Maze reconstructionreconstruction
Maze TC-WM rolloutrollout

Maze

Cheetah originaloriginal
Cheetah reconstructionreconstruction
Cheetah TC-WM rolloutrollout

Cheetah

Analysis

Architecture and projection-head ablations on Robomimic

Architecture / projection ablations on Robomimic. Visual encoder / latent-source choice (a) and projection-head ablation (b), each showing success rate on the left and mirrored SSIM on the right. RP denotes randomly projected DINOv2 embeddings. Linear projection is the best; DINOv2 and DINOv3 dominate as foundation encoders.

Loss component ablation on Lift

Loss components on Lift. Removing embedding reconstruction collapses both planning and visual fidelity; removing proprioceptive supervision lowers success rate while preserving SSIM; the latent split dimension matters little.

BibTeX

@article{fu2026tcwm,
  title  = {Learning Task-Centric World Models from Visual Foundations},
  author = {Fu, Minghao and Feng, Fan and Hansen, Nicklas and Huang, Biwei},
  journal= {arXiv preprint arXiv:2605.25620},
  year   = {2026}
}