conda create -n showui python=3.10
conda activate showui
pip install torch==2.1.2 torchvision==0.16.2 torchaudio==2.1.2 --index-url https://download.pytorch.org/whl/cu118 --user
pip install -r requirements.txt --user
- Download grounding training dataset -- ShowUI-desktop and ShowUI-Web.
- Download AMEX then use our
prepare/hf_amex.pyto create metadata. - Download grounding evaluation dataset -- ScreenSpot
You can use huggingface-cli to download these datasets easily.
cd $_DATA_DIR
huggingface-cli download showlab/ShowUI-desktop --repo-type dataset --local-dir .
huggingface-cli download KevinQHLin/ScreenSpot --repo-type dataset --local-dir .
-
Download GUIAct then use our
prepare/hf_guiact.ipynbto create metadata for each split (i.e., web, mobile). -
Set up Mind2Web, AITW, Miniwob follow SeeClick's Instruction. Then use our
prepare/hf_mind2web/aitw/miniwob.pyto process them and get the metadata.
Then, the dataset should be organized as following:
$_DATA_DIR
- ScreenSpot
- images
- metadata
- AMEX
- images
- metadata
- ShowUI-web
- images
- metadata
- ShowUI-desktop
- images
- metadata
- GUI_Course
- GUIAct
- images
- metadata
- Mind2Web
- images
- metadata
- AITW
- images
- metadata
- MiniWob
- images
- metadata
You can simply re-use existed implementation of dset_shared_grounding.py for UI grounding;
or dset_shared_navigation.py for UI navigation;
For grounding, you just need to define the dataset_mapping for path identification such as "showui": "hf_train.json"
Please organize the UI grounding metadata as following:
"""
sample = {
"img_url": "c12b572ebccfae5052fe62826615c58d.png",
"img_size": [
1920,
1080
],
"element": [
{
"instruction": "Galerie",
"bbox": [
0.6125,
0.35648148148148145,
0.6817708333333333,
0.375
],
"data_type": "text",
"point": [
0.65,
0.37
]
},
{
"instruction": "Coiffure",
"bbox": [
0.30416666666666664,
0.35648148148148145,
0.3770833333333333,
0.375
],
"data_type": "text",
"point": [
0.34,
0.37
]
}],
"element_size": 2
}
"""
For navigation, you need to define the dataset_mapping as above;
Beside, you need to define the action space in template/shared_navigation.py for your customized scenario.
Below are instruction for training on grounding then evaluation on screenspot grounding;
Please keep the bsz as 1, if you want to enlarge the bsz, just increase the grad_accumulation_steps.
Our codebase use Wandb to monitor training process, please provide your own Wandb API key by $WANDB_KEY.
deepspeed --include localhost:1 --master_port 5678 train.py \
--wandb_key=$WANDB_KEY \
--model_id='showlab/ShowUI-2B' \
--version='showlab/ShowUI-2B' \
--dataset_dir=$_DATA_DIR \
--log_base_dir=$_SAVE_DIR \
--epochs=50 \
--steps_per_epoch=100 \
--batch_size=1 \
--grad_accumulation_steps=2 \
--model_max_length=8192 \
--exp_id="debug" \
--train_ratio="1" \
--train_dataset="showui-desktop" \
--train_json="hf_train" \
--val_dataset="screenspot" \
--precision="bf16" \
--attn_imple="sdpa" \
--workers=0 \
--lora_r=32 \
--lora_alpha=64 \
--min_visual_tokens=256 \
--max_visual_tokens=1344 \
--num_turn=100 \
--crop_min=0.5 \
--crop_max=1.5 \
--random_sample \
--record_sample \
--lr=0.0001 \
--uniform_prompt \
--ds_zero="zero2" \
--gradient_checkpointing \
--lm_skip_ratio=0.5 \
--lm_skip_layer='[1,28,0]'
Then, the model checkpoints will be saved under $_SAVE_DIR/$exp_id
We have provided evaluation script for screenspot in main/eval_screenspot.py.
If you want to evaluate on your own setting, you need to define the evaluation function and place it under main/eval_X.py
You should able monitor the training information in wandb panel.
Note: If for evaluation, please apply --eval_only and change the --lora_r=0. Otherwise, the lora will change the model behavior.
The code below utilizes GUI-Act for pre-training a Qwen2VL, followed by evaluation on AITW.
We have set num_history to 2 with interleaved_history='tttt'.
If you have access to greater GPU memory, feel free to switch to vtvt and increase the history length.
deepspeed --include localhost:1 --master_port 5678 train.py \
--wandb_key=$WANDB_KEY \
--model_id='Qwen/Qwen2-VL-2B-Instruct' \
--version='Qwen/Qwen2-VL-2B-Instruct' \
--dataset_dir=$_DATA_DIR \
--log_base_dir=$_SAVE_DIR \
--epochs=50 \
--steps_per_epoch=100 \
--batch_size=1 \
--grad_accumulation_steps=2 \
--model_max_length=8192 \
--exp_id="debug" \
--train_ratio="1,1,1" \
--train_dataset="guiact,guiact,guiact" \
--train_json="hf_train_smartphone,hf_train_web-multi,hf_train_web-single" \
--val_dataset="aitw" \
--val_json="hf_test" \
--precision="bf16" \
--attn_imple="sdpa" \
--workers=0 \
--lora_r=32 \
--lora_alpha=64 \
--min_visual_tokens=256 \
--max_visual_tokens=1344 \
--num_turn=100 \
--random_sample \
--record_sample \
--lr=0.0001 \
--uniform_prompt \
--ds_zero="zero2" \
--gradient_checkpointing \
--lm_skip_ratio=0.5 \
--lm_skip_layer='[1,28,0]' \
--num_history=2 \
--interleaved_history='tttt'
For Mind2Web AITW zero-shot evaluation, we may encounter action mismatches (e.g., TYPE, INPUT), leading to unstable scores. To mitigate this, we pretrain on navigation data and monitor intermediate scores throughout pretraining, reporting the best-performing result.
The code below utilizes downstream training data for fine-tuning our ShowUI.
To ensure a better performance, we enlarge the min_visual_tokens to 1344 and max_visual_tokens to 1680 during fine-tuning stage.
You can easily replace the training train_dataset / validation dataset val_dataset to aitw or mind2web, and replace the train_json or val_json if needed.
- For Mind2Web,
train_dataset=mind2web,train_json='hf_train'andval_json='hf_test_full'. - For AITW,
train_dataset=aitw,train_json='hf_train'andval_json='hf_test'. - For Miniwob,
train_dataset=miniwob,train_json='hf_miniwob. Follow the SeeClick to set up the evaluation environment.
deepspeed --include localhost:1 --master_port 5678 train.py \
--wandb_key=$WANDB_KEY \
--model_id='showlab/ShowUI-2B' \
--version='showlab/ShowUI-2B' \
--dataset_dir=$_DATA_DIR \
--log_base_dir=$_SAVE_DIR \
--epochs=50 \
--steps_per_epoch=100 \
--batch_size=1 \
--grad_accumulation_steps=2 \
--model_max_length=8192 \
--exp_id="debug" \
--train_ratio="1" \
--train_dataset="aitw" \
--train_json="hf_train" \
--val_dataset="aitw" \
--val_json="hf_test" \
--precision="bf16" \
--attn_imple="sdpa" \
--workers=0 \
--lora_r=32 \
--lora_alpha=64 \
--min_visual_tokens=1344 \
--max_visual_tokens=1680 \
--num_turn=100 \
--random_sample \
--record_sample \
--lr=0.0001 \
--uniform_prompt \
--ds_zero="zero2" \
--gradient_checkpointing \
--lm_skip_ratio=0.5 \
--lm_skip_layer='[1,28,0]' \
--num_history=4 \
--interleaved_history='tttt'
Note: If for evaluation, please apply --eval_only and change the --lora_r=0. Otherwise, the lora will change the model behavior.
Below is the instruction to use both grounding and navigation data for co-training. Training on multiple nodes (e.g. 32 GPUs) is recommended.
You can easily add or delete any training data train_dataset and adjust the train_ratio.
deepspeed --include localhost:1 --master_port 5678 train.py \
--wandb_key=$WANDB_KEY \
--model_id='Qwen/Qwen2-VL-2B-Instruct' \
--version='Qwen/Qwen2-VL-2B-Instruct' \
--dataset_dir=$_DATA_DIR \
--log_base_dir=$_SAVE_DIR \
--epochs=50 \
--steps_per_epoch=100 \
--batch_size=1 \
--grad_accumulation_steps=2 \
--model_max_length=8192 \
--exp_id="debug" \
--train_ratio="1,1,1,1,1,1" \
--train_dataset="showui-desktop, showui-web, amex, guiact, guiact, guiact," \
--train_json="hf_train, hf_train, hf_train, hf_train_smartphone, hf_train_web-multi, hf_train_web-single" \
--val_dataset="screenspot" \
--val_json="hf_test" \
--precision="bf16" \
--attn_imple="sdpa" \
--workers=0 \
--lora_r=32 \
--lora_alpha=64 \
--min_visual_tokens=256 \
--max_visual_tokens=1344 \
--num_turn=100 \
--random_sample \
--record_sample \
--lr=0.0001 \
--uniform_prompt \
--ds_zero="zero2" \
--gradient_checkpointing \
--lm_skip_ratio=0.5 \
--lm_skip_layer='[1,28,0]' \
--num_history=2 \
--interleaved_history='tttt'
Once you finished the training, you can use the following cmd to save the model checkpoint.
exp_dir="$_SAVE_DIR/$exp_id/2024-11-28_17-30-32/"
showui_dir=$(pwd)
ckpt_dir="${exp_dir}/ckpt_model/"
merge_dir="${ckpt_dir}/merged_model"
cd "$ckpt_dir" || { echo "Failed to cd to $ckpt_dir"; exit 1; }
python zero_to_fp32.py . pytorch_model.bin
mkdir -p merged_model
cd "$showui_dir"
python3 merge_weight.py --exp_dir="$exp_dir"
echo "$merge_dir"