fix(infra): move data service seed to worker-level config by garrett4wade · Pull Request #1210 · inclusionAI/AReaL

garrett4wade · 2026-04-19T12:53:44Z

Description

Move the random seed from a per-request parameter (set on each load_dataset call and re-set on epoch reset) to a worker-level configuration set once at startup. This prevents repeated seed re-initialization from interfering with data shuffling correctness when multiple datasets are loaded on the same worker. Also adds a datasets_lock to eliminate race conditions on concurrent dataset load/unload operations.

Related Issue

Type of Change

Checklist

I have read the Contributing Guide
Pre-commit hooks pass (pre-commit run --all-files)
Relevant tests pass; new tests added for new functionality
Documentation updated (if applicable; built with ./docs/build_all.sh)
Branch is up to date with main
Self-reviewed via /review-pr command
This PR was created by a coding agent via /create-pr
This PR is a breaking change

Breaking Change Details (if applicable):

N/A

Additional Context

Key changes:

DataServiceConfig gains a seed field; from_dataset_config() now accepts seed from trainer config
Seed passed as --seed CLI arg to worker subprocess, set once in lifespan() via seeding.set_random_seed
Removed seed from WorkerLoadDatasetRequest, _DatasetState, and RDataset.register()
Added datasets_lock (asyncio.Lock) for atomic dataset load/unload — prevents TOCTOU race on the datasets dict
All trainers (PPOTrainer, RWTrainer, SFTTrainer) updated to pass seed through DataServiceConfig instead of per-dataset

Files changed:

areal/infra/data_service/controller/config.py — add seed field + from_dataset_config param
areal/infra/data_service/controller/controller.py — pass --seed arg, remove seed from load request
areal/infra/data_service/rdataset.py — remove seed param from register()
areal/infra/data_service/types.py — remove seed from WorkerLoadDatasetRequest
areal/infra/data_service/worker/__main__.py — add --seed CLI arg
areal/infra/data_service/worker/app.py — seed at startup, add datasets_lock
areal/infra/data_service/worker/config.py — add seed field
areal/trainer/rl_trainer.py — pass seed via DataServiceConfig
areal/trainer/rw_trainer.py — pass seed via DataServiceConfig
areal/trainer/sft_trainer.py — pass seed via DataServiceConfig

gemini-code-assist

Code Review

This pull request refactors the random seeding mechanism in the data service, moving from per-dataset seeds to a global worker-level seed set during initialization. It also introduces a datasets_lock in the worker application to synchronize dataset management. Feedback was provided regarding the concurrency of the worker, specifically suggesting that the datasets_lock should not be held during the long-running dataset loading process to avoid blocking other management operations.

Set random seed once at worker startup instead of per-request during dataset load and epoch reset. This prevents seed re-initialization from interfering with data shuffling across multiple datasets. Key changes: - Add seed field to DataServiceConfig and DataWorkerConfig - Pass seed as CLI arg to worker process, set once in lifespan - Remove seed from WorkerLoadDatasetRequest and _DatasetState - Add datasets_lock for thread-safe dataset load/unload - Update all trainers to pass seed via DataServiceConfig

Prevent races between load/unload and stateful endpoints (fetch, reset, save, load) on the data service worker. Key changes: - Add _loading_ids reservation set so load_dataset does not hold datasets_lock across slow I/O (asyncio.to_thread) - Add unloading flag to _DatasetState; unload_dataset drains in-flight state ops via state.lock before dict removal - Introduce _locked_active_state context manager that checks the unloading flag; apply to fetch, reset, save, load - Add 4 deterministic concurrency regression tests covering duplicate-load rejection, unload drain, stale-fetch 409, and cross-dataset non-blocking

rchardx

LGTM

gemini-code-assist bot reviewed Apr 19, 2026

View reviewed changes

Comment thread areal/infra/data_service/worker/app.py

garrett4wade and others added 2 commits April 20, 2026 16:05

rchardx force-pushed the fw/fix-ds-seed branch from 21f9d9d to a2cf692 Compare April 20, 2026 08:06

rchardx requested review from nuzant and rchardx as code owners April 20, 2026 08:06

rchardx approved these changes Apr 20, 2026

View reviewed changes

rchardx merged commit ad622ef into main Apr 20, 2026
6 checks passed

rchardx deleted the fw/fix-ds-seed branch April 20, 2026 08:08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(infra): move data service seed to worker-level config#1210

fix(infra): move data service seed to worker-level config#1210
rchardx merged 2 commits intomainfrom
fw/fix-ds-seed

garrett4wade commented Apr 19, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

rchardx left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

garrett4wade commented Apr 19, 2026

Description

Related Issue

Type of Change

Checklist

Additional Context

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

rchardx left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants