|
| 1 | +--- |
| 2 | +title: Offline-Near-Selector |
| 3 | +createTime: 2025/11/27 16:02:41 |
| 4 | +permalink: /en/guide/7k0w3d92/ |
| 5 | +icon: flowbite:fish-alt-outline |
| 6 | +--- |
| 7 | +# Offline NEAR Selector |
| 8 | + |
| 9 | +This document introduces how to use the **Offline NEAR Selector** for **dynamic data selection** during supervised fine-tuning (SFT) within the **DataFlex** framework, finding the most close data to the target dataset to improve generalization performance. |
| 10 | + |
| 11 | +--- |
| 12 | + |
| 13 | +## 1. Method Overview |
| 14 | + |
| 15 | +The core idea of **NEAR** is: |
| 16 | + |
| 17 | +* Further encode **already tokenized** samples into **sentence embeddings** (e.g., 512‑dim). |
| 18 | +* Perform **nearest‑neighbor search ** in the embedding space to obtain each sample’s representativeness score. |
| 19 | + |
| 20 | +> Intuition: **Closest data for the target dataset** |
| 21 | +
|
| 22 | +### Scoring Formulation |
| 23 | + |
| 24 | +Let the sentence embedding of a sample be $e_i$, and let its $max_K$ nearest neighbors be $\mathcal{N}_K(i)$. |
| 25 | + |
| 26 | + |
| 27 | + |
| 28 | +--- |
| 29 | + |
| 30 | +## 2. Environment & Dependencies |
| 31 | + |
| 32 | +```bash |
| 33 | +# DataFlex (recommended: editable install) |
| 34 | +git clone https://github.com/OpenDCAI/DataFlex.git |
| 35 | +cd DataFlex |
| 36 | +pip install -e . |
| 37 | + |
| 38 | +# Common training/inference dependencies (as needed) |
| 39 | +pip install llamafactory |
| 40 | + |
| 41 | +# NEAR extras (vector search & progress bars) |
| 42 | +pip install faiss-cpu vllm sentence-transformer |
| 43 | +``` |
| 44 | + |
| 45 | +--- |
| 46 | + |
| 47 | +## 3. Offline Selection |
| 48 | + |
| 49 | +Modify training set, embedding model, and parameters inside |
| 50 | +**DataFlex/src/dataflex/offline_selector/offline_near_selector.py**: |
| 51 | +```python |
| 52 | +if __name__ == "__main__": |
| 53 | + near = offline_near_Selector( |
| 54 | + candidate_path="OpenDCAI/DataFlex-selector-openhermes-10w", # split = train |
| 55 | + query_path="OpenDCAI/DataFlex-selector-openhermes-10w", # split = vaildation |
| 56 | + |
| 57 | + # If you want to use vllm,please add "vllm:" before model's name |
| 58 | + # Otherwise it automatically use sentence-transfromer |
| 59 | + embed_model="vllm:Qwen/Qwen3-Embedding-0.6B", |
| 60 | + batch_size=32, |
| 61 | + save_indices_path="top_indices.npy", |
| 62 | + max_K=1000, |
| 63 | + |
| 64 | + ) |
| 65 | + near.selector() |
| 66 | +``` |
| 67 | + |
| 68 | +Note: model_name is used to encode the already-tokenized text into sentence embeddings (e.g., 512-dim), supporting both vLLM and sentence-transformer inference. |
| 69 | + |
| 70 | +Output: save as the indices matrix that contain the max_K close data for each query |
| 71 | +--- |
| 72 | + |
| 73 | +## 4. Key Hyperparameters & Tips |
| 74 | + |
| 75 | +| Parameter | Typical Range | Meaning & Tips | |
| 76 | +| ------------- | ------------- | --------------------------------------------------------------------------------------------- | |
| 77 | +| `max_K` | 64–10000 | Upper bound of NN retrieval. Larger = stabler but more costly; balance with data size & VRAM. | | |
| 78 | +| `model_name` | — | Path/name of the sentence encoder (local BERT/USE/SimCSE, etc.). | |
| 79 | +| `cache_dir` | — | Cache directory for intermediate artifacts and resume‑from‑cache. | |
| 80 | + |
| 81 | +--- |
| 82 | + |
| 83 | +## 5. Component Config (`components.yaml`) |
| 84 | + |
| 85 | +**Path:** `DataFlex/src/dataflex/configs/components.yaml` |
| 86 | + |
| 87 | +**Preset example** |
| 88 | + |
| 89 | +```yaml |
| 90 | +near: |
| 91 | + name: near |
| 92 | + params: |
| 93 | + indices_path: ./src/dataflex/offline_selector/top_indices.npy |
| 94 | + cache_dir: ../dataflex_saves/near_output |
| 95 | + |
| 96 | +``` |
| 97 | + |
| 98 | +--- |
| 99 | + |
| 100 | +## 6. Dynamic Training Config (LoRA + NEAR) |
| 101 | + |
| 102 | +**Example file:** `DataFlex/examples/train_lora/selectors/near.yaml` |
| 103 | + |
| 104 | +```yaml |
| 105 | +### model |
| 106 | +model_name_or_path: |
| 107 | +trust_remote_code: true |
| 108 | + |
| 109 | +### method |
| 110 | +stage: sft |
| 111 | +do_train: true |
| 112 | +finetuning_type: lora |
| 113 | +lora_target: all |
| 114 | +lora_rank: 16 |
| 115 | +lora_alpha: 8 |
| 116 | + |
| 117 | +### dataset |
| 118 | +dataset: # training dataset |
| 119 | +template: qwen |
| 120 | +cutoff_len: 4096 |
| 121 | +overwrite_cache: true |
| 122 | +preprocessing_num_workers: 16 |
| 123 | + |
| 124 | +### output |
| 125 | +output_dir: ../dataflex_saves |
| 126 | +logging_steps: 10 |
| 127 | +save_steps: 100 |
| 128 | +plot_loss: true |
| 129 | +overwrite_output_dir: true |
| 130 | + |
| 131 | +### train |
| 132 | +per_device_train_batch_size: 2 |
| 133 | +gradient_accumulation_steps: 16 |
| 134 | +learning_rate: 1.0e-4 |
| 135 | +num_train_epochs: 1.0 |
| 136 | +lr_scheduler_type: cosine |
| 137 | +warmup_ratio: 0.1 |
| 138 | +bf16: true |
| 139 | + |
| 140 | +### Dataflex args |
| 141 | +train_type: dynamic_select |
| 142 | +components_cfg_file: src/dataflex/configs/components.yaml |
| 143 | +component_name: near |
| 144 | +warmup_step: 400 |
| 145 | +update_step: 500 |
| 146 | +update_times: 2 |
| 147 | + |
| 148 | +``` |
| 149 | +
|
| 150 | +**Notes:** |
| 151 | +
|
| 152 | +* `component_name: near` enables the NEAR component. |
| 153 | +* `warmup_step / update_step / update_times` decide **when** and **how often** to re‑select the training subset; total steps ≈ `warmup_step + update_step × update_times`. |
| 154 | +* total batch_size=device_number x per_device_train_batch_size x gradient_accumulation_steps |
| 155 | + |
| 156 | +--- |
| 157 | + |
| 158 | +## 7. Run Training |
| 159 | + |
| 160 | +```bash |
| 161 | +FORCE_TORCHRUN=1 DISABLE_VERSION_CHECK=1 dataflex-cli train examples/train_lora/selectors/near.yaml |
| 162 | +``` |
| 163 | + |
| 164 | +**Note:** the above example runs with distributed launch. |
| 165 | + |
| 166 | +During training, NEAR is triggered at scheduled steps: base the sample indice → select the next training subset. |
| 167 | + |
| 168 | +--- |
| 169 | + |
| 170 | +## 8. Merge & Export the Model |
| 171 | + |
| 172 | +Same as the Less Selector pipeline. |
| 173 | + |
| 174 | +**Config file:** `DataFlex/examples/merge_lora/llama3_lora_sft.yaml` |
| 175 | + |
| 176 | +```yaml |
| 177 | +model_name_or_path: base model path |
| 178 | +adapter_name_or_path: finetuned adapter path |
| 179 | +template: qwen |
| 180 | +trust_remote_code: true |
| 181 | +
|
| 182 | +export_dir: ../dataflex_saves |
| 183 | +export_size: 5 |
| 184 | +export_device: cpu |
| 185 | +export_legacy_format: false |
| 186 | +
|
| 187 | +``` |
| 188 | + |
| 189 | +Run the export command (inside the LLaMA‑Factory directory): |
| 190 | + |
| 191 | +```bash |
| 192 | +llamafactory-cli export llama3_lora_sft.yaml |
| 193 | +``` |
| 194 | + |
| 195 | +--- |
| 196 | + |
| 197 | +## 9. Evaluation & Comparison |
| 198 | + |
| 199 | +We recommend using the [DataFlow](https://github.com/OpenDCAI/DataFlow) QA evaluation pipeline to compare **NEAR** against **Less** and **random sampling**. |
| 200 | + |
| 201 | + |
0 commit comments