Exploring the capabilities of the V-JEPA2 model

How does the latent space of V-JEPA2 look like, compared to that of image encoders such as DINOv2, DINOv3, which I explored in earlier repositories. Also potentially these SSL pre-trained models are ways to train World Models without massive amounts of compute.

Investigate the latent space of the V-JEPA2 model by:

PCA without masking.
PCA with masking, what will we recover?

Check out the Exploration.ipynb notebook for a more detailed walkthrough of the code and ideas behind it.

Compare transition model on latent space predictions of VJEPA2 with DINO. Currently, skipping other encoders.
- Add a decoder for visualization purposes.
Generate a better dataset, option for simple RGB frame environment.
- Balancing a pendulum, and secondly include the actions.
Add option for actions with MPC and CEM.

Check out the World Model.ipynb notebook to test the planning of the model. However, the continuous action space is a lot more difficult than the original paper with discrete action space.

(WIP) Currently, have the best working MPC CEM loop for the pendulum so far but it does not always succeed.

Setup

Install the packages using the requirements.txt file.

# using conda
conda create --name jepa python=3.11
conda activate jepa

# Run the training code, adjust the argparse flags
python train_world_model.py
# Test the planning of the model on the pendulum environment
python test_planning.py

Model Weights

The model is split into components, the action embedding network, ViT latent predictor for future state prediction, and the decoder for visualizing the latents.

Finally, put these into the output folder so the networks can easily be loaded back in.

PCA visualizations

The V-JEPA2 model takes in two frames merges them to output in the output space as the tubelet size is 2. Give a number of frames of a kitesurfing video below.

When passing for example frame 5, and 6 through the encoder we get out the following latent features when processing them with PCA for visualization purposes. The encoder clearly seperates the kites in both frames.

World Model

These are the outputs of training the future latent state predictor and decoder on top of the Pendulum environment. As you can see it predicts the first 3 states pretty accurately.

As for the latent state comparison between the predictor and the encoder. These are also comparable.

A short snippet where it does suceed in balancing the pendulum.

References

Assran, M., Bardes, A., Fan, D., Garrido, Q., Howes, R., Mojtaba, Komeili, Muckley, M., Rizvi, A., Roberts, C., Sinha, K., Zholus, A., Arnaud, S., Gejji, A., Martin, A., Hogan, F. R., Dugas, D., Bojanowski, P., Khalidov, V., … Ballas, N. (2025). V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning (No. arXiv:2506.09985). arXiv. https://doi.org/10.48550/arXiv.2506.09985

Kim, I. H., Cho, S., Huang, J., Yi, J., Lee, J.-Y., & Kim, S. (2025). Exploring Temporally-Aware Features for Point Tracking (No. arXiv:2501.12218). arXiv. https://doi.org/10.48550/arXiv.2501.12218

Zhou, G., Pan, H., LeCun, Y., & Pinto, L. (2025). DINO-WM: World Models on Pre-trained Visual Features enable Zero-shot Planning (No. arXiv:2411.04983). arXiv. https://doi.org/10.48550/arXiv.2411.04983

Name		Name	Last commit message	Last commit date
Latest commit History 33 Commits
assets		assets
output		output
video_jepa		video_jepa
.gitignore		.gitignore
Exploration.ipynb		Exploration.ipynb
README.md		README.md
World Model.ipynb		World Model.ipynb
requirements.txt		requirements.txt
test_planning.py		test_planning.py
train_world_model.py		train_world_model.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Exploring the capabilities of the V-JEPA2 model

Setup

Model Weights

PCA visualizations

World Model

References

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Exploring the capabilities of the V-JEPA2 model

Setup

Model Weights

PCA visualizations

World Model

References

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages