How does the latent space of V-JEPA2 look like, compared to that of image encoders such as DINOv2, DINOv3, which I explored in earlier repositories. Also potentially these SSL pre-trained models are ways to train World Models without massive amounts of compute.
Investigate the latent space of the V-JEPA2 model by:
- PCA without masking.
- PCA with masking, what will we recover?
Check out the Exploration.ipynb notebook for a more detailed walkthrough of the code and ideas behind it.
- Compare transition model on latent space predictions of VJEPA2 with DINO. Currently, skipping other encoders.
- Add a decoder for visualization purposes.
- Generate a better dataset, option for simple RGB frame environment.
- Balancing a pendulum, and secondly include the actions.
- Add option for actions with MPC and CEM.
Check out the World Model.ipynb notebook to test the planning of the model. However, the continuous action space is a lot more difficult than the original paper with discrete action space.
(WIP) Currently, have the best working MPC CEM loop for the pendulum so far but it does not always succeed.
Install the packages using the requirements.txt file.
# using conda
conda create --name jepa python=3.11
conda activate jepa
# Run the training code, adjust the argparse flags
python train_world_model.py
# Test the planning of the model on the pendulum environment
python test_planning.pyThe model is split into components, the action embedding network, ViT latent predictor for future state prediction, and the decoder for visualizing the latents.
Finally, put these into the output folder so the networks can easily be loaded back in.
The V-JEPA2 model takes in two frames merges them to output in the output space as the tubelet size is 2. Give a number of frames of a kitesurfing video below.

When passing for example frame 5, and 6 through the encoder we get out the following latent features when processing them with PCA for visualization purposes. The encoder clearly seperates the kites in both frames.

These are the outputs of training the future latent state predictor and decoder on top of the Pendulum environment. As you can see it predicts the first 3 states pretty accurately.

As for the latent state comparison between the predictor and the encoder. These are also comparable.

A short snippet where it does suceed in balancing the pendulum.

Assran, M., Bardes, A., Fan, D., Garrido, Q., Howes, R., Mojtaba, Komeili, Muckley, M., Rizvi, A., Roberts, C., Sinha, K., Zholus, A., Arnaud, S., Gejji, A., Martin, A., Hogan, F. R., Dugas, D., Bojanowski, P., Khalidov, V., … Ballas, N. (2025). V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning (No. arXiv:2506.09985). arXiv. https://doi.org/10.48550/arXiv.2506.09985
Kim, I. H., Cho, S., Huang, J., Yi, J., Lee, J.-Y., & Kim, S. (2025). Exploring Temporally-Aware Features for Point Tracking (No. arXiv:2501.12218). arXiv. https://doi.org/10.48550/arXiv.2501.12218
Zhou, G., Pan, H., LeCun, Y., & Pinto, L. (2025). DINO-WM: World Models on Pre-trained Visual Features enable Zero-shot Planning (No. arXiv:2411.04983). arXiv. https://doi.org/10.48550/arXiv.2411.04983