Skip to content

Commit b0e933b

Browse files
authored
update doremi mixer docs (#12)
1 parent 5d78771 commit b0e933b

2 files changed

Lines changed: 30 additions & 50 deletions

File tree

docs/en/notes/guide/mixer/doremi.md

Lines changed: 15 additions & 25 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
---
22
title: DoReMi Data Mixer
3-
createTime: 2025/01/30 10:00:00
3+
createTime: 2025/11/27 10:00:00
44
icon: material-symbols:balance
55
permalink: /en/guide/mixer/doremi/
66
---
@@ -33,9 +33,6 @@ component_name: static # Use static mixer
3333
mixture_sample_rule: mixture
3434
init_mixture_proportions: [0.5, 0.5] # Initial weights, uniform distribution
3535
static_mix: true
36-
warmup_step: 100
37-
update_step: 200
38-
update_times: 3
3936
```
4037
4138
**Key Parameters**:
@@ -56,7 +53,7 @@ mixers:
5653

5754
### Step 2: Proxy Model Weight Optimization
5855

59-
Use the DoReMi algorithm to dynamically optimize domain weights on a small proxy model. The algorithm adjusts weights by computing excess loss for each domain.
56+
Use the DoReMi algorithm to dynamically optimize domain weights on a small proxy model. The algorithm adjusts weights by computing excess loss for each domain. During training, the algorithm uses uniform sampling for data selection, but the optimized domain weights are recorded and used for loss reweighting in the training step.
6057

6158
**Configuration File**: `doremi_step2_dynamic_qwen_pt_full.yaml`
6259

@@ -82,30 +79,30 @@ mixers:
8279
# Reference model path from Step 1
8380
reference_model_path: /path/to/doremi_step1_result/checkpoint-xxx
8481
# Weight update learning rate (eta in DoReMi paper)
85-
reweight_eta: 1.0
82+
reweight_eta: 0.1
8683
# Weight smoothing parameter (epsilon in DoReMi paper)
87-
reweight_eps: 1e-3
88-
# Number of samples to evaluate per domain
89-
num_eval_samples: 1000
90-
# Batch size for evaluation
91-
eval_batch_size: 8
84+
reweight_eps: 0.01
9285
```
9386

9487
**Key Parameters**:
9588
- `reference_model_path`: Path to the reference model checkpoint from Step 1
9689
- `reweight_eta`: Learning rate for weight updates, controls adjustment magnitude
9790
- `reweight_eps`: Smoothing parameter to prevent domain weights from becoming too small
98-
- `num_eval_samples`: Number of samples per domain for computing excess loss
9991
- `warmup_step`: Number of warmup training steps before starting weight optimization
10092
- `update_step`: Frequency of weight updates (every N steps)
10193

94+
**Algorithm Behavior**:
95+
- The algorithm uses **uniform sampling** for data selection (each domain has equal probability)
96+
- The optimized `domain_weights` are computed and used for **loss reweighting** during training
97+
- This approach ensures fair sampling while allowing the loss function to focus on harder domains
98+
10299
**Weight Logging**:
103100

104101
During training, a `doremi_weights.jsonl` file is automatically generated, recording detailed information for each weight update:
105102

106103
```json
107-
{"step": 100, "timestamp": "2025-01-30 10:00:00", "domain_names": ["wiki", "c4"], "domain_weights": [0.3, 0.7], "perdomain_scores": [2.5, 3.2], "reweight_eta": 1.0, "reweight_eps": 0.001}
108-
{"step": 300, "timestamp": "2025-01-30 10:10:00", "domain_names": ["wiki", "c4"], "domain_weights": [0.25, 0.75], "perdomain_scores": [2.3, 3.5], "reweight_eta": 1.0, "reweight_eps": 0.001}
104+
{"step": 100, "timestamp": "2025-11-27 10:00:00", "domain_names": ["wiki", "c4"], "domain_weights": [0.3, 0.7], "perdomain_scores": [2.5, 3.2]}
105+
{"step": 300, "timestamp": "2025-11-27 10:10:00", "domain_names": ["wiki", "c4"], "domain_weights": [0.25, 0.75], "perdomain_scores": [2.3, 3.5]}
109106
```
110107

111108
### Step 3: Target Model Training
@@ -122,9 +119,6 @@ component_name: static # Use static mixer
122119
mixture_sample_rule: mixture
123120
init_mixture_proportions: [0.3, 0.7] # Use optimized weights from Step 2
124121
static_mix: true
125-
warmup_step: 100
126-
update_step: 200
127-
update_times: 3
128122
```
129123

130124
**Key Steps**:
@@ -198,9 +192,10 @@ plt.show()
198192
### 2. Weight Optimization
199193

200194
- Recommend using small proxy models (e.g., 0.5B-1B parameters) to reduce computational cost
201-
- Set `num_eval_samples` between 1000-5000 to balance evaluation accuracy and speed
202-
- `reweight_eta` is typically set to 1.0, adjust based on convergence
203-
- Recommend at least 3-5 weight updates (`update_times`) to observe convergence trends
195+
- `reweight_eta` can be adjusted based on convergence (higher values lead to faster weight changes)
196+
- `reweight_eps` controls the minimum weight for each domain
197+
- Recommend observing convergence trends to set appropriate number of weight updates (`update_times`)
198+
- The algorithm uses uniform sampling but applies domain weights to loss reweighting
204199

205200
### 3. Target Model Training
206201

@@ -238,8 +233,3 @@ A: Yes. If `reference_model_path` is set to `null`, the algorithm will directly
238233
- Paper: [DoReMi: Optimizing Data Mixtures Speeds Up Language Model Pretraining](https://arxiv.org/abs/2305.10429)
239234
- Project: [DataFlex GitHub](https://github.com/OpenDCAI/DataFlex)
240235

241-
## Related Components
242-
243-
- [Static Mixer](/en/guide/mixer/static/)
244-
- [Mixture Manager](/en/guide/data/mixture/)
245-

docs/zh/notes/guide/mixer/doremi.md

Lines changed: 15 additions & 25 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
---
22
title: DoReMi 数据混合器
3-
createTime: 2025/01/30 10:00:00
3+
createTime: 2025/11/27 10:00:00
44
icon: material-symbols:balance
55
permalink: /zh/guide/mixer/doremi/
66
---
@@ -33,9 +33,6 @@ component_name: static # 使用静态混合器
3333
mixture_sample_rule: mixture
3434
init_mixture_proportions: [0.5, 0.5] # 初始权重,这里使用均匀分布
3535
static_mix: true
36-
warmup_step: 100
37-
update_step: 200
38-
update_times: 3
3936
```
4037
4138
**关键参数说明**:
@@ -56,7 +53,7 @@ mixers:
5653

5754
### Step 2: 代理模型权重优化
5855

59-
使用 DoReMi 算法在小型代理模型上动态优化领域权重。算法会通过计算各领域的过剩损失(excess loss)来调整权重。
56+
使用 DoReMi 算法在小型代理模型上动态优化领域权重。算法会通过计算各领域的过剩损失(excess loss)来调整权重。训练过程中,算法使用均匀采样进行数据选择,但优化后的领域权重会被记录并用于训练步骤中的损失加权。
6057

6158
**配置文件**: `doremi_step2_dynamic_qwen_pt_full.yaml`
6259

@@ -82,30 +79,30 @@ mixers:
8279
# Step 1 训练得到的参考模型路径
8380
reference_model_path: /path/to/doremi_step1_result/checkpoint-xxx
8481
# 权重更新学习率 (DoReMi 论文中的 eta)
85-
reweight_eta: 1.0
82+
reweight_eta: 0.1
8683
# 权重平滑参数 (DoReMi 论文中的 epsilon)
87-
reweight_eps: 1e-3
88-
# 每个领域评估的样本数
89-
num_eval_samples: 1000
90-
# 评估时的批次大小
91-
eval_batch_size: 8
84+
reweight_eps: 0.01
9285
```
9386

9487
**关键参数说明**:
9588
- `reference_model_path`: Step 1 训练得到的参考模型检查点路径
9689
- `reweight_eta`: 权重更新的学习率,控制权重调整幅度
9790
- `reweight_eps`: 平滑参数,防止某些领域权重过小
98-
- `num_eval_samples`: 每个领域用于计算过剩损失的样本数
9991
- `warmup_step`: 在开始权重优化前的预热训练步数
10092
- `update_step`: 每隔多少步更新一次领域权重
10193

94+
**算法行为**:
95+
- 算法使用**均匀采样**进行数据选择(每个领域具有相等的采样概率)
96+
- 优化后的 `domain_weights` 会被计算并用于训练过程中的**损失加权**
97+
- 这种方法确保了公平采样,同时允许损失函数关注更困难的领域
98+
10299
**权重日志**:
103100

104101
训练过程中会自动生成 `doremi_weights.jsonl` 文件,记录每次权重更新的详细信息:
105102

106103
```json
107-
{"step": 100, "timestamp": "2025-01-30 10:00:00", "domain_names": ["wiki", "c4"], "domain_weights": [0.3, 0.7], "perdomain_scores": [2.5, 3.2], "reweight_eta": 1.0, "reweight_eps": 0.001}
108-
{"step": 300, "timestamp": "2025-01-30 10:10:00", "domain_names": ["wiki", "c4"], "domain_weights": [0.25, 0.75], "perdomain_scores": [2.3, 3.5], "reweight_eta": 1.0, "reweight_eps": 0.001}
104+
{"step": 100, "timestamp": "2025-11-27 10:00:00", "domain_names": ["wiki", "c4"], "domain_weights": [0.3, 0.7], "perdomain_scores": [2.5, 3.2]}
105+
{"step": 300, "timestamp": "2025-11-27 10:10:00", "domain_names": ["wiki", "c4"], "domain_weights": [0.25, 0.75], "perdomain_scores": [2.3, 3.5]}
109106
```
110107

111108
### Step 3: 目标模型训练
@@ -122,9 +119,6 @@ component_name: static # 使用静态混合器
122119
mixture_sample_rule: mixture
123120
init_mixture_proportions: [0.3, 0.7] # 使用 Step 2 优化得到的最终权重
124121
static_mix: true
125-
warmup_step: 100
126-
update_step: 200
127-
update_times: 3
128122
```
129123

130124
**关键步骤**:
@@ -198,9 +192,10 @@ plt.show()
198192
### 2. 权重优化
199193

200194
- 代理模型建议使用小型模型(如 0.5B-1B 参数)以降低计算成本
201-
- `num_eval_samples` 设置在 1000-5000 之间,平衡评估准确性和速度
202-
- `reweight_eta` 通常设置为 1.0,可根据收敛情况调整
203-
- 建议至少进行 3-5 次权重更新(`update_times`)以观察收敛趋势
195+
- `reweight_eta` 可根据收敛情况调整(值越大权重变化越快)
196+
- `reweight_eps` 控制每个领域的最小权重
197+
- 建议观察收敛趋势以设定合适的权重更新次数(`update_times`)
198+
- 算法使用均匀采样,但将领域权重应用于损失加权
204199

205200
### 3. 目标模型训练
206201

@@ -238,8 +233,3 @@ A: 可以。如果 `reference_model_path` 设置为 `null`,算法会直接使
238233
- 论文: [DoReMi: Optimizing Data Mixtures Speeds Up Language Model Pretraining](https://arxiv.org/abs/2305.10429)
239234
- 项目地址: [DataFlex GitHub](https://github.com/OpenDCAI/DataFlex)
240235

241-
## 相关组件
242-
243-
- [静态混合器 (Static Mixer)](/zh/guide/mixer/static/)
244-
- [数据混合管理器 (Mixture Manager)](/zh/guide/data/mixture/)
245-

0 commit comments

Comments
 (0)