Skip to content

Commit 5e15f96

Browse files
authored
revise offline_tsds,near and remove origin tsds (#14)
1 parent b0e933b commit 5e15f96

File tree

6 files changed

+516
-132
lines changed

6 files changed

+516
-132
lines changed

docs/.vuepress/notes/en/guide.ts

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -25,7 +25,8 @@ export const Guide: ThemeNote = defineNoteConfig({
2525
'quickstart',
2626
'tutorial',
2727
'selector_less',
28-
'selector_tsds',
28+
'selector_offline_tsds',
29+
'selector_offline_near',
2930
'selector_zeroth'
3031
],
3132
},

docs/.vuepress/notes/zh/guide.ts

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -25,7 +25,8 @@ export const Guide: ThemeNote = defineNoteConfig({
2525
'quickstart',
2626
'tutorial',
2727
'selector_less',
28-
'selector_tsds',
28+
'selector_offline_tsds',
29+
'selector_offline_near',
2930
'selector_zeroth',
3031
],
3132
},
Lines changed: 201 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,201 @@
1+
---
2+
title: Offline-Near-Selector
3+
createTime: 2025/11/27 16:02:41
4+
permalink: /en/guide/7k0w3d92/
5+
icon: flowbite:fish-alt-outline
6+
---
7+
# Offline NEAR Selector
8+
9+
This document introduces how to use the **Offline NEAR Selector** for **dynamic data selection** during supervised fine-tuning (SFT) within the **DataFlex** framework, finding the most close data to the target dataset to improve generalization performance.
10+
11+
---
12+
13+
## 1. Method Overview
14+
15+
The core idea of **NEAR** is:
16+
17+
* Further encode **already tokenized** samples into **sentence embeddings** (e.g., 512‑dim).
18+
* Perform **nearest‑neighbor search ** in the embedding space to obtain each sample’s representativeness score.
19+
20+
> Intuition: **Closest data for the target dataset**
21+
22+
### Scoring Formulation
23+
24+
Let the sentence embedding of a sample be $e_i$, and let its $max_K$ nearest neighbors be $\mathcal{N}_K(i)$.
25+
26+
27+
28+
---
29+
30+
## 2. Environment & Dependencies
31+
32+
```bash
33+
# DataFlex (recommended: editable install)
34+
git clone https://github.com/OpenDCAI/DataFlex.git
35+
cd DataFlex
36+
pip install -e .
37+
38+
# Common training/inference dependencies (as needed)
39+
pip install llamafactory
40+
41+
# NEAR extras (vector search & progress bars)
42+
pip install faiss-cpu vllm sentence-transformer
43+
```
44+
45+
---
46+
47+
## 3. Offline Selection
48+
49+
Modify training set, embedding model, and parameters inside
50+
**DataFlex/src/dataflex/offline_selector/offline_near_selector.py**:
51+
```python
52+
if __name__ == "__main__":
53+
near = offline_near_Selector(
54+
candidate_path="OpenDCAI/DataFlex-selector-openhermes-10w", # split = train
55+
query_path="OpenDCAI/DataFlex-selector-openhermes-10w", # split = vaildation
56+
57+
# If you want to use vllm,please add "vllm:" before model's name
58+
# Otherwise it automatically use sentence-transfromer
59+
embed_model="vllm:Qwen/Qwen3-Embedding-0.6B",
60+
batch_size=32,
61+
save_indices_path="top_indices.npy",
62+
max_K=1000,
63+
64+
)
65+
near.selector()
66+
```
67+
68+
Note: model_name is used to encode the already-tokenized text into sentence embeddings (e.g., 512-dim), supporting both vLLM and sentence-transformer inference.
69+
70+
Output: save as the indices matrix that contain the max_K close data for each query
71+
---
72+
73+
## 4. Key Hyperparameters & Tips
74+
75+
| Parameter | Typical Range | Meaning & Tips |
76+
| ------------- | ------------- | --------------------------------------------------------------------------------------------- |
77+
| `max_K` | 64–10000 | Upper bound of NN retrieval. Larger = stabler but more costly; balance with data size & VRAM. | |
78+
| `model_name` || Path/name of the sentence encoder (local BERT/USE/SimCSE, etc.). |
79+
| `cache_dir` || Cache directory for intermediate artifacts and resume‑from‑cache. |
80+
81+
---
82+
83+
## 5. Component Config (`components.yaml`)
84+
85+
**Path:** `DataFlex/src/dataflex/configs/components.yaml`
86+
87+
**Preset example**
88+
89+
```yaml
90+
near:
91+
name: near
92+
params:
93+
indices_path: ./src/dataflex/offline_selector/top_indices.npy
94+
cache_dir: ../dataflex_saves/near_output
95+
96+
```
97+
98+
---
99+
100+
## 6. Dynamic Training Config (LoRA + NEAR)
101+
102+
**Example file:** `DataFlex/examples/train_lora/selectors/near.yaml`
103+
104+
```yaml
105+
### model
106+
model_name_or_path:
107+
trust_remote_code: true
108+
109+
### method
110+
stage: sft
111+
do_train: true
112+
finetuning_type: lora
113+
lora_target: all
114+
lora_rank: 16
115+
lora_alpha: 8
116+
117+
### dataset
118+
dataset: # training dataset
119+
template: qwen
120+
cutoff_len: 4096
121+
overwrite_cache: true
122+
preprocessing_num_workers: 16
123+
124+
### output
125+
output_dir: ../dataflex_saves
126+
logging_steps: 10
127+
save_steps: 100
128+
plot_loss: true
129+
overwrite_output_dir: true
130+
131+
### train
132+
per_device_train_batch_size: 2
133+
gradient_accumulation_steps: 16
134+
learning_rate: 1.0e-4
135+
num_train_epochs: 1.0
136+
lr_scheduler_type: cosine
137+
warmup_ratio: 0.1
138+
bf16: true
139+
140+
### Dataflex args
141+
train_type: dynamic_select
142+
components_cfg_file: src/dataflex/configs/components.yaml
143+
component_name: near
144+
warmup_step: 400
145+
update_step: 500
146+
update_times: 2
147+
148+
```
149+
150+
**Notes:**
151+
152+
* `component_name: near` enables the NEAR component.
153+
* `warmup_step / update_step / update_times` decide **when** and **how often** to re‑select the training subset; total steps ≈ `warmup_step + update_step × update_times`.
154+
* total batch_size=device_number x per_device_train_batch_size x gradient_accumulation_steps
155+
156+
---
157+
158+
## 7. Run Training
159+
160+
```bash
161+
FORCE_TORCHRUN=1 DISABLE_VERSION_CHECK=1 dataflex-cli train examples/train_lora/selectors/near.yaml
162+
```
163+
164+
**Note:** the above example runs with distributed launch.
165+
166+
During training, NEAR is triggered at scheduled steps: base the sample indice → select the next training subset.
167+
168+
---
169+
170+
## 8. Merge & Export the Model
171+
172+
Same as the Less Selector pipeline.
173+
174+
**Config file:** `DataFlex/examples/merge_lora/llama3_lora_sft.yaml`
175+
176+
```yaml
177+
model_name_or_path: base model path
178+
adapter_name_or_path: finetuned adapter path
179+
template: qwen
180+
trust_remote_code: true
181+
182+
export_dir: ../dataflex_saves
183+
export_size: 5
184+
export_device: cpu
185+
export_legacy_format: false
186+
187+
```
188+
189+
Run the export command (inside the LLaMA‑Factory directory):
190+
191+
```bash
192+
llamafactory-cli export llama3_lora_sft.yaml
193+
```
194+
195+
---
196+
197+
## 9. Evaluation & Comparison
198+
199+
We recommend using the [DataFlow](https://github.com/OpenDCAI/DataFlow) QA evaluation pipeline to compare **NEAR** against **Less** and **random sampling**.
200+
201+

0 commit comments

Comments
 (0)