feat(engine): support direct engine construction via from_pretrained without config dataclass#1140
feat(engine): support direct engine construction via from_pretrained without config dataclass#1140chenzhiyi021 wants to merge 11 commits intoinclusionAI:mainfrom
Conversation
- Add from_pretrained method in class FSDPEngine - Integrate the engine created by from_pretrained method into test_train_engine.py - test_train_engine.py pass
There was a problem hiding this comment.
Code Review
This pull request introduces a from_pretrained class method to the FSDPEngine class, allowing for direct instantiation from a model path and parameters without the need to manually construct a TrainEngineConfig. The test suite has been updated to include coverage for this new construction method. Review feedback identifies a potential issue where mandatory fields in TrainEngineConfig, such as experiment_name and trial_name, are missing, which could lead to runtime errors. Additionally, it is recommended to use local model paths in unit tests instead of hardcoded HuggingFace IDs to improve test reliability and execution speed.
| config = TrainEngineConfig( | ||
| path=model, | ||
| backend=f"fsdp:d{dp_size}t{tp_size}", | ||
| dtype=dtype, | ||
| optimizer=optimizer_config, | ||
| use_lora=use_lora, | ||
| lora_rank=lora_rank, | ||
| lora_alpha=lora_alpha, | ||
| **kwargs, | ||
| ) |
There was a problem hiding this comment.
The TrainEngineConfig dataclass requires experiment_name and trial_name as mandatory fields (marked as MISSING in areal/api/cli_args.py). If these are not provided in kwargs, the resulting config object will contain MISSING sentinels instead of strings. This will cause runtime errors in methods like _update_weights_from_disk which use these fields for path construction.
I suggest providing sensible default values if they are not explicitly passed via kwargs.
| config = TrainEngineConfig( | |
| path=model, | |
| backend=f"fsdp:d{dp_size}t{tp_size}", | |
| dtype=dtype, | |
| optimizer=optimizer_config, | |
| use_lora=use_lora, | |
| lora_rank=lora_rank, | |
| lora_alpha=lora_alpha, | |
| **kwargs, | |
| ) | |
| config = TrainEngineConfig( | |
| experiment_name=kwargs.pop("experiment_name", "from_pretrained"), | |
| trial_name=kwargs.pop("trial_name", "default"), | |
| path=model, | |
| backend=f"fsdp:d{dp_size}t{tp_size}", | |
| dtype=dtype, | |
| optimizer=optimizer_config, | |
| use_lora=use_lora, | |
| lora_rank=lora_rank, | |
| lora_alpha=lora_alpha, | |
| **kwargs, | |
| ) |
| engine = fsdp_module.FSDPEngine.from_pretrained( | ||
| model="Qwen/Qwen2.5-1.5B-Instruct", | ||
| dp_size=1, | ||
| learning_rate=1e-5, | ||
| use_lora=True, | ||
| ) |
There was a problem hiding this comment.
Using a hardcoded HuggingFace model ID like "Qwen/Qwen2.5-1.5B-Instruct" in unit tests is problematic as it requires internet access and can be slow. It is better to use the MODEL_PATH variable already defined in the test file, which points to a local mock model or a cached version suitable for testing.
| engine = fsdp_module.FSDPEngine.from_pretrained( | |
| model="Qwen/Qwen2.5-1.5B-Instruct", | |
| dp_size=1, | |
| learning_rate=1e-5, | |
| use_lora=True, | |
| ) | |
| engine = fsdp_module.FSDPEngine.from_pretrained( | |
| model=MODEL_PATH, | |
| dp_size=1, | |
| learning_rate=1e-5, | |
| use_lora=True, | |
| ) |
| config = TrainEngineConfig( | ||
| path="Qwen/Qwen2.5-1.5B-Instruct", | ||
| backend="fsdp:d1t1", | ||
| optimizer=OptimizerConfig(lr=1e-5), | ||
| use_lora=True, | ||
| ) |
There was a problem hiding this comment.
For consistency with the suggested change above and to avoid network dependencies, please use MODEL_PATH here as well.
| config = TrainEngineConfig( | |
| path="Qwen/Qwen2.5-1.5B-Instruct", | |
| backend="fsdp:d1t1", | |
| optimizer=OptimizerConfig(lr=1e-5), | |
| use_lora=True, | |
| ) | |
| config = TrainEngineConfig( | |
| path=MODEL_PATH, | |
| backend="fsdp:d1t1", | |
| optimizer=OptimizerConfig(lr=1e-5), | |
| use_lora=True, | |
| ) |
f4c1da3 to
13f45e1
Compare
| self.enable_tree_training: bool = self.config.enable_tree_training | ||
|
|
||
| @classmethod | ||
| def from_pretrained( |
There was a problem hiding this comment.
I think there’s a correctness gap in FSDPEngine.from_pretrained().
The factory records dp_size / tp_size in config.backend, but on the direct-construction path create_process_group() does not seem to read self.config.backend when parallel_strategy is omitted. It just falls back to ParallelStrategy(), which means the engine still initializes with the default 1x1 topology.
So FSDPEngine.from_pretrained(..., dp_size=2) looks supported by the API, but in practice it appears to run as d1/t1. That’s different from the normal controller/config path, where config.backend is parsed before engine setup.
Could we either derive the parallel strategy from config.backend by default here, or otherwise make it explicit that non-default dp_size / tp_size are not actually honored on this path? I think a regression test for from_pretrained(dp_size>1 or tp_size>1) would also help.
| self._engine = RemoteInfEngine(config, SGLangBackend()) | ||
|
|
||
| @classmethod | ||
| def from_pretrained( |
There was a problem hiding this comment.
I think RemoteSGLangEngine.from_pretrained() is a bit narrower than the existing config-based API in a way that may surprise callers.
Right now the factory always injects tokenizer_path=model and also forwards **kwargs. So if someone wants to use a different tokenizer path — which InferenceEngineConfig does support — passing tokenizer_path=... through kwargs would raise a duplicate-keyword error instead of behaving like the normal config construction path.
So the direct-construction API is not fully equivalent to explicit InferenceEngineConfig construction here.
Maybe we should add an explicit tokenizer_path: str | None = None argument and use tokenizer_path or model, or only set tokenizer_path when the caller did not already provide it.
There was a problem hiding this comment.
Please tell me what you think about it.
| config = TrainEngineConfig( | ||
| path=model, | ||
| backend=f"fsdp:d{dp_size}t{tp_size}", | ||
| dtype=dtype, | ||
| optimizer=optimizer_config, | ||
| use_lora=use_lora, | ||
| lora_rank=lora_rank, | ||
| lora_alpha=lora_alpha, | ||
| **kwargs, | ||
| ) |
There was a problem hiding this comment.
I think there’s another lifecycle gap in FSDPEngine.from_pretrained() around experiment_name and trial_name.
The new factory in areal/engine/fsdp_engine.py builds a TrainEngineConfig without setting those fields, but later the disk weight-update path in the same file (_update_weights_from_disk()) still uses self.config.experiment_name and self.config.trial_name to publish the update rendezvous key.
On the other side, areal/infra/remote_inf_engine.py also expects config.experiment_name and config.trial_name to be present for disk-based weight updates, and raises if they are missing.
So this looks like a cross-file mismatch: the new convenience constructor can create an engine that passes the new tests, but may still break once it is used in a rollout-connected or disk-update workflow.
Would it make sense to require experiment_name and trial_name in the train-engine from_pretrained() path, or at least fail early before disk-update flows if they are unset? A parity test for from_pretrained() plus a disk update / connected rollout path would probably catch this.
…assmethod of FSDPEngine.
…ne and fix data_parallel_size parse.
|
Thank you for your review. I changed RemoteSGLangEngine's factory classmethod to: I resolved the backend parse problem for both RemoteSGLangEngine and FSDPEngine. I also add experiment_name and trial_name parameters to FSDPEngine's factory classmethod. |
Description
Decouple FSDPEngine from the config system by adding a from_pretrained factory method.
Related Issue
This PR addresses the task discussed in this comment under Issue #907.
Type of Change
Checklist
pre-commit run --all-files)./docs/build_all.sh)main/review-prcommand/create-prBreaking Change Details (if applicable):
Additional Context
Need help? Check the Contributing Guide or ask in
GitHub Discussions!