Skip to content

feat(engine): support direct engine construction via from_pretrained without config dataclass#1140

Open
chenzhiyi021 wants to merge 11 commits intoinclusionAI:mainfrom
chenzhiyi021:czy/engine-kwargs
Open

feat(engine): support direct engine construction via from_pretrained without config dataclass#1140
chenzhiyi021 wants to merge 11 commits intoinclusionAI:mainfrom
chenzhiyi021:czy/engine-kwargs

Conversation

@chenzhiyi021
Copy link
Copy Markdown

@chenzhiyi021 chenzhiyi021 commented Apr 7, 2026

  • Add from_pretrained method in class FSDPEngine
  • Integrate the engine created by from_pretrained method into test_train_engine.py
  • test_train_engine.py pass

Description

Decouple FSDPEngine from the config system by adding a from_pretrained factory method.

Related Issue

This PR addresses the task discussed in this comment under Issue #907.

Type of Change

  • 🐛 Bug fix
  • ✨ New feature
  • 💥 Breaking change
  • 📝 Documentation update
  • ♻️ Refactoring
  • ⚡ Performance improvement
  • ✅ Test coverage improvement

Checklist

  • I have read the Contributing Guide
  • Pre-commit hooks pass (pre-commit run --all-files)
  • Relevant tests pass; new tests added for new functionality
  • Documentation updated (if applicable; built with ./docs/build_all.sh)
  • Branch is up to date with main
  • Self-reviewed via /review-pr command
  • This PR was created by a coding agent via /create-pr
  • This PR is a breaking change

Breaking Change Details (if applicable):

Additional Context


Need help? Check the Contributing Guide or ask in
GitHub Discussions!

- Add from_pretrained method in class FSDPEngine
- Integrate the engine created by from_pretrained method into
  test_train_engine.py
 - test_train_engine.py pass
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a from_pretrained class method to the FSDPEngine class, allowing for direct instantiation from a model path and parameters without the need to manually construct a TrainEngineConfig. The test suite has been updated to include coverage for this new construction method. Review feedback identifies a potential issue where mandatory fields in TrainEngineConfig, such as experiment_name and trial_name, are missing, which could lead to runtime errors. Additionally, it is recommended to use local model paths in unit tests instead of hardcoded HuggingFace IDs to improve test reliability and execution speed.

Comment on lines +244 to +253
config = TrainEngineConfig(
path=model,
backend=f"fsdp:d{dp_size}t{tp_size}",
dtype=dtype,
optimizer=optimizer_config,
use_lora=use_lora,
lora_rank=lora_rank,
lora_alpha=lora_alpha,
**kwargs,
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The TrainEngineConfig dataclass requires experiment_name and trial_name as mandatory fields (marked as MISSING in areal/api/cli_args.py). If these are not provided in kwargs, the resulting config object will contain MISSING sentinels instead of strings. This will cause runtime errors in methods like _update_weights_from_disk which use these fields for path construction.

I suggest providing sensible default values if they are not explicitly passed via kwargs.

Suggested change
config = TrainEngineConfig(
path=model,
backend=f"fsdp:d{dp_size}t{tp_size}",
dtype=dtype,
optimizer=optimizer_config,
use_lora=use_lora,
lora_rank=lora_rank,
lora_alpha=lora_alpha,
**kwargs,
)
config = TrainEngineConfig(
experiment_name=kwargs.pop("experiment_name", "from_pretrained"),
trial_name=kwargs.pop("trial_name", "default"),
path=model,
backend=f"fsdp:d{dp_size}t{tp_size}",
dtype=dtype,
optimizer=optimizer_config,
use_lora=use_lora,
lora_rank=lora_rank,
lora_alpha=lora_alpha,
**kwargs,
)

Comment on lines +391 to +396
engine = fsdp_module.FSDPEngine.from_pretrained(
model="Qwen/Qwen2.5-1.5B-Instruct",
dp_size=1,
learning_rate=1e-5,
use_lora=True,
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Using a hardcoded HuggingFace model ID like "Qwen/Qwen2.5-1.5B-Instruct" in unit tests is problematic as it requires internet access and can be slow. It is better to use the MODEL_PATH variable already defined in the test file, which points to a local mock model or a cached version suitable for testing.

Suggested change
engine = fsdp_module.FSDPEngine.from_pretrained(
model="Qwen/Qwen2.5-1.5B-Instruct",
dp_size=1,
learning_rate=1e-5,
use_lora=True,
)
engine = fsdp_module.FSDPEngine.from_pretrained(
model=MODEL_PATH,
dp_size=1,
learning_rate=1e-5,
use_lora=True,
)

Comment on lines +397 to +402
config = TrainEngineConfig(
path="Qwen/Qwen2.5-1.5B-Instruct",
backend="fsdp:d1t1",
optimizer=OptimizerConfig(lr=1e-5),
use_lora=True,
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

For consistency with the suggested change above and to avoid network dependencies, please use MODEL_PATH here as well.

Suggested change
config = TrainEngineConfig(
path="Qwen/Qwen2.5-1.5B-Instruct",
backend="fsdp:d1t1",
optimizer=OptimizerConfig(lr=1e-5),
use_lora=True,
)
config = TrainEngineConfig(
path=MODEL_PATH,
backend="fsdp:d1t1",
optimizer=OptimizerConfig(lr=1e-5),
use_lora=True,
)

self.enable_tree_training: bool = self.config.enable_tree_training

@classmethod
def from_pretrained(
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there’s a correctness gap in FSDPEngine.from_pretrained().

The factory records dp_size / tp_size in config.backend, but on the direct-construction path create_process_group() does not seem to read self.config.backend when parallel_strategy is omitted. It just falls back to ParallelStrategy(), which means the engine still initializes with the default 1x1 topology.

So FSDPEngine.from_pretrained(..., dp_size=2) looks supported by the API, but in practice it appears to run as d1/t1. That’s different from the normal controller/config path, where config.backend is parsed before engine setup.

Could we either derive the parallel strategy from config.backend by default here, or otherwise make it explicit that non-default dp_size / tp_size are not actually honored on this path? I think a regression test for from_pretrained(dp_size>1 or tp_size>1) would also help.

self._engine = RemoteInfEngine(config, SGLangBackend())

@classmethod
def from_pretrained(
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think RemoteSGLangEngine.from_pretrained() is a bit narrower than the existing config-based API in a way that may surprise callers.

Right now the factory always injects tokenizer_path=model and also forwards **kwargs. So if someone wants to use a different tokenizer path — which InferenceEngineConfig does support — passing tokenizer_path=... through kwargs would raise a duplicate-keyword error instead of behaving like the normal config construction path.

So the direct-construction API is not fully equivalent to explicit InferenceEngineConfig construction here.

Maybe we should add an explicit tokenizer_path: str | None = None argument and use tokenizer_path or model, or only set tokenizer_path when the caller did not already provide it.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please tell me what you think about it.

Comment on lines +252 to +261
config = TrainEngineConfig(
path=model,
backend=f"fsdp:d{dp_size}t{tp_size}",
dtype=dtype,
optimizer=optimizer_config,
use_lora=use_lora,
lora_rank=lora_rank,
lora_alpha=lora_alpha,
**kwargs,
)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there’s another lifecycle gap in FSDPEngine.from_pretrained() around experiment_name and trial_name.

The new factory in areal/engine/fsdp_engine.py builds a TrainEngineConfig without setting those fields, but later the disk weight-update path in the same file (_update_weights_from_disk()) still uses self.config.experiment_name and self.config.trial_name to publish the update rendezvous key.

On the other side, areal/infra/remote_inf_engine.py also expects config.experiment_name and config.trial_name to be present for disk-based weight updates, and raises if they are missing.

So this looks like a cross-file mismatch: the new convenience constructor can create an engine that passes the new tests, but may still break once it is used in a rollout-connected or disk-update workflow.

Would it make sense to require experiment_name and trial_name in the train-engine from_pretrained() path, or at least fail early before disk-update flows if they are unset? A parity test for from_pretrained() plus a disk update / connected rollout path would probably catch this.

@chenzhiyi021
Copy link
Copy Markdown
Author

Thank you for your review.

I changed RemoteSGLangEngine's factory classmethod to:

@classmethod
    def from_pretrained(
        cls,
        tokenizer_path: str | None = None,
        dp_size: int = 1,
        max_concurrent_rollouts: int | None = None,
        **kwargs,
    ) -> "RemoteInfEngine":

I resolved the backend parse problem for both RemoteSGLangEngine and FSDPEngine. I also add experiment_name and trial_name parameters to FSDPEngine's factory classmethod.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants