Skip to content

RobvanGastel/adapt-vjepa-world-model

Repository files navigation

Exploring the capabilities of the V-JEPA2 model

How does the latent space of V-JEPA2 look like, compared to that of image encoders such as DINOv2, DINOv3, which I explored in earlier repositories. Also potentially these SSL pre-trained models are ways to train World Models without massive amounts of compute.

Investigate the latent space of the V-JEPA2 model by:

  • PCA without masking.
  • PCA with masking, what will we recover?

Check out the Exploration.ipynb notebook for a more detailed walkthrough of the code and ideas behind it.

  • Compare transition model on latent space predictions of VJEPA2 with DINO. Currently, skipping other encoders.
    • Add a decoder for visualization purposes.
  • Generate a better dataset, option for simple RGB frame environment.
    • Balancing a pendulum, and secondly include the actions.
  • Add option for actions with MPC and CEM.

Check out the World Model.ipynb notebook to test the planning of the model. However, the continuous action space is a lot more difficult than the original paper with discrete action space.

(WIP) Currently, have the best working MPC CEM loop for the pendulum so far but it does not always succeed.

Setup

Install the packages using the requirements.txt file.

# using conda
conda create --name jepa python=3.11
conda activate jepa

# Run the training code, adjust the argparse flags
python train_world_model.py
# Test the planning of the model on the pendulum environment
python test_planning.py

Model Weights

The model is split into components, the action embedding network, ViT latent predictor for future state prediction, and the decoder for visualizing the latents.

Finally, put these into the output folder so the networks can easily be loaded back in.

PCA visualizations

The V-JEPA2 model takes in two frames merges them to output in the output space as the tubelet size is 2. Give a number of frames of a kitesurfing video below.

When passing for example frame 5, and 6 through the encoder we get out the following latent features when processing them with PCA for visualization purposes. The encoder clearly seperates the kites in both frames.

World Model

These are the outputs of training the future latent state predictor and decoder on top of the Pendulum environment. As you can see it predicts the first 3 states pretty accurately.

As for the latent state comparison between the predictor and the encoder. These are also comparable.

A short snippet where it does suceed in balancing the pendulum.

References

Assran, M., Bardes, A., Fan, D., Garrido, Q., Howes, R., Mojtaba, Komeili, Muckley, M., Rizvi, A., Roberts, C., Sinha, K., Zholus, A., Arnaud, S., Gejji, A., Martin, A., Hogan, F. R., Dugas, D., Bojanowski, P., Khalidov, V., … Ballas, N. (2025). V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning (No. arXiv:2506.09985). arXiv. https://doi.org/10.48550/arXiv.2506.09985

Kim, I. H., Cho, S., Huang, J., Yi, J., Lee, J.-Y., & Kim, S. (2025). Exploring Temporally-Aware Features for Point Tracking (No. arXiv:2501.12218). arXiv. https://doi.org/10.48550/arXiv.2501.12218

Zhou, G., Pan, H., LeCun, Y., & Pinto, L. (2025). DINO-WM: World Models on Pre-trained Visual Features enable Zero-shot Planning (No. arXiv:2411.04983). arXiv. https://doi.org/10.48550/arXiv.2411.04983

About

Can the V-JEPA2 model be used as a world model?

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors