feat(engine): support direct engine construction via from_pretrained without config dataclass by chenzhiyi021 · Pull Request #1140 · inclusionAI/AReaL

chenzhiyi021 · 2026-04-07T00:22:03Z

Add from_pretrained method in class FSDPEngine
Integrate the engine created by from_pretrained method into test_train_engine.py
test_train_engine.py pass

Description

Decouple FSDPEngine from the config system by adding a from_pretrained factory method.

Related Issue

This PR addresses the task discussed in this comment under Issue #907.

Type of Change

Checklist

I have read the Contributing Guide
Pre-commit hooks pass (pre-commit run --all-files)
Relevant tests pass; new tests added for new functionality
Documentation updated (if applicable; built with ./docs/build_all.sh)
Branch is up to date with main
Self-reviewed via /review-pr command
This PR was created by a coding agent via /create-pr
This PR is a breaking change

Breaking Change Details (if applicable):

Additional Context

Need help? Check the Contributing Guide or ask in
GitHub Discussions!

- Add from_pretrained method in class FSDPEngine - Integrate the engine created by from_pretrained method into test_train_engine.py - test_train_engine.py pass

gemini-code-assist

Code Review

This pull request introduces a from_pretrained class method to the FSDPEngine class, allowing for direct instantiation from a model path and parameters without the need to manually construct a TrainEngineConfig. The test suite has been updated to include coverage for this new construction method. Review feedback identifies a potential issue where mandatory fields in TrainEngineConfig, such as experiment_name and trial_name, are missing, which could lead to runtime errors. Additionally, it is recommended to use local model paths in unit tests instead of hardcoded HuggingFace IDs to improve test reliability and execution speed.

gemini-code-assist · 2026-04-07T00:27:15Z

+        config = TrainEngineConfig(
+            path=model,
+            backend=f"fsdp:d{dp_size}t{tp_size}",
+            dtype=dtype,
+            optimizer=optimizer_config,
+            use_lora=use_lora,
+            lora_rank=lora_rank,
+            lora_alpha=lora_alpha,
+            **kwargs,
+        )


The TrainEngineConfig dataclass requires experiment_name and trial_name as mandatory fields (marked as MISSING in areal/api/cli_args.py). If these are not provided in kwargs, the resulting config object will contain MISSING sentinels instead of strings. This will cause runtime errors in methods like _update_weights_from_disk which use these fields for path construction.

I suggest providing sensible default values if they are not explicitly passed via kwargs.

Suggested change

config = TrainEngineConfig(

path=model,

backend=f"fsdp:d{dp_size}t{tp_size}",

dtype=dtype,

optimizer=optimizer_config,

use_lora=use_lora,

lora_rank=lora_rank,

lora_alpha=lora_alpha,

**kwargs,

)

config = TrainEngineConfig(

experiment_name=kwargs.pop("experiment_name", "from_pretrained"),

trial_name=kwargs.pop("trial_name", "default"),

path=model,

backend=f"fsdp:d{dp_size}t{tp_size}",

dtype=dtype,

optimizer=optimizer_config,

use_lora=use_lora,

lora_rank=lora_rank,

lora_alpha=lora_alpha,

**kwargs,

)

gemini-code-assist · 2026-04-07T00:27:16Z

+    engine = fsdp_module.FSDPEngine.from_pretrained(
+        model="Qwen/Qwen2.5-1.5B-Instruct",
+        dp_size=1,
+        learning_rate=1e-5,
+        use_lora=True,
+    )


Using a hardcoded HuggingFace model ID like "Qwen/Qwen2.5-1.5B-Instruct" in unit tests is problematic as it requires internet access and can be slow. It is better to use the MODEL_PATH variable already defined in the test file, which points to a local mock model or a cached version suitable for testing.

Suggested change

engine = fsdp_module.FSDPEngine.from_pretrained(

model="Qwen/Qwen2.5-1.5B-Instruct",

dp_size=1,

learning_rate=1e-5,

use_lora=True,

)

engine = fsdp_module.FSDPEngine.from_pretrained(

model=MODEL_PATH,

dp_size=1,

learning_rate=1e-5,

use_lora=True,

)

gemini-code-assist · 2026-04-07T00:27:16Z

+    config = TrainEngineConfig(
+        path="Qwen/Qwen2.5-1.5B-Instruct",
+        backend="fsdp:d1t1",
+        optimizer=OptimizerConfig(lr=1e-5),
+        use_lora=True,
+    )


For consistency with the suggested change above and to avoid network dependencies, please use MODEL_PATH here as well.

Suggested change

config = TrainEngineConfig(

path="Qwen/Qwen2.5-1.5B-Instruct",

backend="fsdp:d1t1",

optimizer=OptimizerConfig(lr=1e-5),

use_lora=True,

)

config = TrainEngineConfig(

path=MODEL_PATH,

backend="fsdp:d1t1",

optimizer=OptimizerConfig(lr=1e-5),

use_lora=True,

)

…GLangEngine

rchardx · 2026-04-13T12:06:12Z

        self.enable_tree_training: bool = self.config.enable_tree_training

+    @classmethod
+    def from_pretrained(


I think there’s a correctness gap in FSDPEngine.from_pretrained().

The factory records dp_size / tp_size in config.backend, but on the direct-construction path create_process_group() does not seem to read self.config.backend when parallel_strategy is omitted. It just falls back to ParallelStrategy(), which means the engine still initializes with the default 1x1 topology.

So FSDPEngine.from_pretrained(..., dp_size=2) looks supported by the API, but in practice it appears to run as d1/t1. That’s different from the normal controller/config path, where config.backend is parsed before engine setup.

Could we either derive the parallel strategy from config.backend by default here, or otherwise make it explicit that non-default dp_size / tp_size are not actually honored on this path? I think a regression test for from_pretrained(dp_size>1 or tp_size>1) would also help.

rchardx · 2026-04-13T12:08:19Z

        self._engine = RemoteInfEngine(config, SGLangBackend())

+    @classmethod
+    def from_pretrained(


I think RemoteSGLangEngine.from_pretrained() is a bit narrower than the existing config-based API in a way that may surprise callers.

Right now the factory always injects tokenizer_path=model and also forwards **kwargs. So if someone wants to use a different tokenizer path — which InferenceEngineConfig does support — passing tokenizer_path=... through kwargs would raise a duplicate-keyword error instead of behaving like the normal config construction path.

So the direct-construction API is not fully equivalent to explicit InferenceEngineConfig construction here.

Maybe we should add an explicit tokenizer_path: str | None = None argument and use tokenizer_path or model, or only set tokenizer_path when the caller did not already provide it.

Please tell me what you think about it.

rchardx · 2026-04-13T12:11:17Z

+        config = TrainEngineConfig(
+            path=model,
+            backend=f"fsdp:d{dp_size}t{tp_size}",
+            dtype=dtype,
+            optimizer=optimizer_config,
+            use_lora=use_lora,
+            lora_rank=lora_rank,
+            lora_alpha=lora_alpha,
+            **kwargs,
+        )


I think there’s another lifecycle gap in FSDPEngine.from_pretrained() around experiment_name and trial_name.

The new factory in areal/engine/fsdp_engine.py builds a TrainEngineConfig without setting those fields, but later the disk weight-update path in the same file (_update_weights_from_disk()) still uses self.config.experiment_name and self.config.trial_name to publish the update rendezvous key.

On the other side, areal/infra/remote_inf_engine.py also expects config.experiment_name and config.trial_name to be present for disk-based weight updates, and raises if they are missing.

So this looks like a cross-file mismatch: the new convenience constructor can create an engine that passes the new tests, but may still break once it is used in a rollout-connected or disk-update workflow.

Would it make sense to require experiment_name and trial_name in the train-engine from_pretrained() path, or at least fail early before disk-update flows if they are unset? A parity test for from_pretrained() plus a disk update / connected rollout path would probably catch this.

…gression test.

…ction

…assmethod of FSDPEngine.

…ne and fix data_parallel_size parse.

…ck of assertion.

chenzhiyi021 · 2026-04-19T17:16:41Z

Thank you for your review.

I changed RemoteSGLangEngine's factory classmethod to:

@classmethod
    def from_pretrained(
        cls,
        tokenizer_path: str | None = None,
        dp_size: int = 1,
        max_concurrent_rollouts: int | None = None,
        **kwargs,
    ) -> "RemoteInfEngine":

I resolved the backend parse problem for both RemoteSGLangEngine and FSDPEngine. I also add experiment_name and trial_name parameters to FSDPEngine's factory classmethod.

chenzhiyi021 added 2 commits April 7, 2026 07:37

feat(engine): add from_pretrained factory method for FSDPEngine

4308dd2

- Add from_pretrained method in class FSDPEngine - Integrate the engine created by from_pretrained method into test_train_engine.py - test_train_engine.py pass

style: auto-fix trailing whitespace and ruff issues

0e3d44d

gemini-code-assist bot reviewed Apr 7, 2026

View reviewed changes

chenzhiyi021 added 2 commits April 8, 2026 04:23

Merge branch 'inclusionAI:main' into czy/engine-kwargs

4671624

feat(engine): add from_pretrained factory method and test for RemoteS…

13f45e1

…GLangEngine

chenzhiyi021 force-pushed the czy/engine-kwargs branch from f4c1da3 to 13f45e1 Compare April 11, 2026 19:46

chenzhiyi021 mentioned this pull request Apr 12, 2026

[Feature] Support FSDPEngine and RemoteSGLangEngine with factory classmethod #1168

Open

1 task

rchardx requested changes Apr 13, 2026

View reviewed changes

fix: make create_process_group method parse backend config and add re…

61c4e1f

…gression test.

chenzhiyi021 requested review from garrett4wade and nuzant as code owners April 19, 2026 09:47

chenzhiyi021 added 5 commits April 19, 2026 21:08

fix: fix bug of backend parse and test_fsdp_engine_alloc_mode_constru…

01dcfde

…ction

fix: add experiment_name and trial_name parameters for the factory cl…

20bb563

…assmethod of FSDPEngine.

fix: change the model parameter to tokenizer_path in RemoteSGLangEngi…

5b3bd3c

…ne and fix data_parallel_size parse.

fix: fix test_fsdp_engine_alloc_mode_construction()

accc466

fix: delete test_fsdp_engine_alloc_mode_construction() because the la…

760c500

…ck of assertion.

style: Apply pre-commit formatting

fd62f72

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(engine): support direct engine construction via from_pretrained without config dataclass#1140

feat(engine): support direct engine construction via from_pretrained without config dataclass#1140
chenzhiyi021 wants to merge 11 commits intoinclusionAI:mainfrom
chenzhiyi021:czy/engine-kwargs

chenzhiyi021 commented Apr 7, 2026 •

edited

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Apr 7, 2026

Uh oh!

gemini-code-assist bot Apr 7, 2026

Uh oh!

gemini-code-assist bot Apr 7, 2026

Uh oh!

rchardx Apr 13, 2026

Uh oh!

rchardx Apr 13, 2026

Uh oh!

rchardx Apr 13, 2026

Uh oh!

rchardx Apr 13, 2026

Uh oh!

chenzhiyi021 commented Apr 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

chenzhiyi021 commented Apr 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Related Issue

Type of Change

Checklist

Additional Context

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Apr 7, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Apr 7, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Apr 7, 2026

Choose a reason for hiding this comment

Uh oh!

rchardx Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

rchardx Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

rchardx Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

rchardx Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

chenzhiyi021 commented Apr 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

chenzhiyi021 commented Apr 7, 2026 •

edited

Loading