To start training, run the embodiment script with your configuration file:
bash policy_and_value/policy_online/examples/embodiment/run_embodiment.sh YOUR_CONFIG_NAME
# Example:
bash policy_and_value/policy_online/examples/embodiment/run_embodiment.sh rl_release
Notes: The initial run may take a while (~10 minutes) as loading the dynamics and reward models, along with torch compile building the acceleration graph, all contribute to the latency. For debugging purposes, you can set actor.model.openpi.use_torch_compile = False.
You can flexibly configure the GPU allocation for the env, rollout, and actor components in your YAML config. Here are three common deployment strategies:
- Partial Sharing (Default): Components share some GPUs while keeping others dedicated.
cluster:
num_nodes: 1
component_placement:
env: 0-3
rollout: 4-7
actor: 0-7
- Complete Sharing: All components share all available GPUs.
cluster:
num_nodes: 1
component_placement:
env,rollout,actor: all- Complete Separation: Each component uses its own GPUs without interference, eliminating the need for offload functionality.
cluster:
num_nodes: 1
component_placement:
env: 0-1
rollout: 2-5
actor: 6-7
For N-node training, change cluster.num_nodes to N and assign the component_placement accordingly. (e.g., If N=2 and each node has 8 GPUs, the placement indices range from 0 to 15).
Run the multi-task unified training command:
bash policy_and_value/policy_online/examples/embodiment/run_embodiment_ray_unified_multi_task.sh YOUR_CONFIG_NAME
To resume training, modify runner.resume_dir in your config to point to your target checkpoint:
runner:
resume_dir: logs/20251221-00:15:14/${runner.logger.experiment_name}/checkpoints/global_step_13000| Parameter | Description |
|---|---|
algorithm.num_group_envs |
Number of parallel environments for rollout. (e.g., If set to 32 with 4 GPUs for rollout, each GPU handles 8 envs). |
algorithm.rollout_epoch |
Number of epochs for rollouts. |
algorithm.policy_config_name |
Task-specific configuration. Must strictly align with your offline (IL) training setting. |
rollout.model_dir |
Path to your pretrained IL model for initialization. |
actor.micro_batch_size |
Micro-batch size per GPU. |
actor.global_batch_size |
Global batch size across all GPUs. |
model.action_dim |
Expected action dimension output for VLA models. |
rollout_ema_decay |
EMA preserving weight for each rollout model update. |
dynamics_model_config |
Task-specific configuration for the dynamics model. |
dynamics_model_image_root |
(Optional) Custom path for dynamics model images. |
dynamics_model_output_path |
(Optional) Custom path for dynamics model outputs. |
reward_model_config |
Task-specific configuration for the reward model. |
reward_model_ckpt |
Checkpoint path for the reward model. |
visualize_wm_pred |
Set to True to visualize your world model predictions. If True, the chunk_reward should be True too. |
chunk_reward |
Set to True to use only the reward of the last predicted frame as the reward for the current action chunk. |
advantage_scale |
Weighted coefficient for the computed advantage. |
Note: For other configurations not listed here, we adopt most settings from RLinf. Please refer to the RLinf Documentation for more details.
Once you have trained your own VLA model, you need to convert the Distributed Checkpoint (.dcp) to a PyTorch state dict (.pt) before deployment.
Run the converter script:
python toolkits/ckpt_convertor/convert_dcp_to_state_dict.py \
--dcp_path <YOUR_DCP_CKPT_DIR> \
--output_path <YOUR_EXPECTED_PT_CKPT_DIR>
After conversion, you can use the generated .pt checkpoints on your deployment machine to infer actions.