Skip to content

fix(infra): move data service seed to worker-level config#1210

Merged
rchardx merged 2 commits intomainfrom
fw/fix-ds-seed
Apr 20, 2026
Merged

fix(infra): move data service seed to worker-level config#1210
rchardx merged 2 commits intomainfrom
fw/fix-ds-seed

Conversation

@garrett4wade
Copy link
Copy Markdown
Collaborator

Description

Move the random seed from a per-request parameter (set on each load_dataset call and re-set on epoch reset) to a worker-level configuration set once at startup. This prevents repeated seed re-initialization from interfering with data shuffling correctness when multiple datasets are loaded on the same worker. Also adds a datasets_lock to eliminate race conditions on concurrent dataset load/unload operations.

Related Issue

Type of Change

  • 🐛 Bug fix
  • ✨ New feature
  • 💥 Breaking change
  • 📝 Documentation update
  • ♻️ Refactoring
  • ⚡ Performance improvement
  • ✅ Test coverage improvement

Checklist

  • I have read the Contributing Guide
  • Pre-commit hooks pass (pre-commit run --all-files)
  • Relevant tests pass; new tests added for new functionality
  • Documentation updated (if applicable; built with ./docs/build_all.sh)
  • Branch is up to date with main
  • Self-reviewed via /review-pr command
  • This PR was created by a coding agent via /create-pr
  • This PR is a breaking change

Breaking Change Details (if applicable):

N/A

Additional Context

Key changes:

  • DataServiceConfig gains a seed field; from_dataset_config() now accepts seed from trainer config
  • Seed passed as --seed CLI arg to worker subprocess, set once in lifespan() via seeding.set_random_seed
  • Removed seed from WorkerLoadDatasetRequest, _DatasetState, and RDataset.register()
  • Added datasets_lock (asyncio.Lock) for atomic dataset load/unload — prevents TOCTOU race on the datasets dict
  • All trainers (PPOTrainer, RWTrainer, SFTTrainer) updated to pass seed through DataServiceConfig instead of per-dataset

Files changed:

  • areal/infra/data_service/controller/config.py — add seed field + from_dataset_config param
  • areal/infra/data_service/controller/controller.py — pass --seed arg, remove seed from load request
  • areal/infra/data_service/rdataset.py — remove seed param from register()
  • areal/infra/data_service/types.py — remove seed from WorkerLoadDatasetRequest
  • areal/infra/data_service/worker/__main__.py — add --seed CLI arg
  • areal/infra/data_service/worker/app.py — seed at startup, add datasets_lock
  • areal/infra/data_service/worker/config.py — add seed field
  • areal/trainer/rl_trainer.py — pass seed via DataServiceConfig
  • areal/trainer/rw_trainer.py — pass seed via DataServiceConfig
  • areal/trainer/sft_trainer.py — pass seed via DataServiceConfig

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors the random seeding mechanism in the data service, moving from per-dataset seeds to a global worker-level seed set during initialization. It also introduces a datasets_lock in the worker application to synchronize dataset management. Feedback was provided regarding the concurrency of the worker, specifically suggesting that the datasets_lock should not be held during the long-running dataset loading process to avoid blocking other management operations.

Comment thread areal/infra/data_service/worker/app.py
garrett4wade and others added 2 commits April 20, 2026 16:05
Set random seed once at worker startup instead of per-request during
dataset load and epoch reset. This prevents seed re-initialization
from interfering with data shuffling across multiple datasets.

Key changes:
- Add seed field to DataServiceConfig and DataWorkerConfig
- Pass seed as CLI arg to worker process, set once in lifespan
- Remove seed from WorkerLoadDatasetRequest and _DatasetState
- Add datasets_lock for thread-safe dataset load/unload
- Update all trainers to pass seed via DataServiceConfig
Prevent races between load/unload and stateful endpoints
(fetch, reset, save, load) on the data service worker.

Key changes:
- Add _loading_ids reservation set so load_dataset does not
  hold datasets_lock across slow I/O (asyncio.to_thread)
- Add unloading flag to _DatasetState; unload_dataset drains
  in-flight state ops via state.lock before dict removal
- Introduce _locked_active_state context manager that checks
  the unloading flag; apply to fetch, reset, save, load
- Add 4 deterministic concurrency regression tests covering
  duplicate-load rejection, unload drain, stale-fetch 409,
  and cross-dataset non-blocking
Copy link
Copy Markdown
Collaborator

@rchardx rchardx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@rchardx rchardx merged commit ad622ef into main Apr 20, 2026
6 checks passed
@rchardx rchardx deleted the fw/fix-ds-seed branch April 20, 2026 08:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants