Skip to content

Commit 6a71b0f

Browse files
authored
refactor(utils): reorganize domain-specific utils into natural homes (#925)
Move ~28 domain-specific files out of areal/utils/ into their natural domains, continuing the pattern set by recent infra refactors. Key changes: - Move FSDP utils (6 files) to areal/engine/fsdp_utils/ - Move Megatron/FP8 utils (13 files) to areal/engine/megatron_utils/ - Move infra utils (7 files) to areal/infra/utils/ - Move distributed.py, model.py to areal/engine/core/ - Fix determinisitc.py typo → deterministic.py - Refactor PYTHONPATH in launcher.py to marker-based repo root lookup - Update all imports (~75 files) and documentation
1 parent 4e5b30b commit 6a71b0f

76 files changed

Lines changed: 232 additions & 191 deletions

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

.claude/agents/fsdp-engine-expert.md

Lines changed: 15 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -164,13 +164,13 @@ algorithm-specific subclasses
164164

165165
- `areal/api/alloc_mode.py` - `FSDPParallelStrategy` (inherits from `ParallelStrategy`)
166166
for FSDP-specific parallel dimensions
167-
- `areal/utils/fsdp/parallel.py` - `ParallelHelper` class for mesh construction and
168-
dimension validation
167+
- `areal/engine/fsdp_utils/parallel.py` - `ParallelHelper` class for mesh construction
168+
and dimension validation
169169

170170
**Model Parallelism Implementations**:
171171

172-
- **Tensor Parallelism**: `areal/utils/fsdp/parallel.py` - `apply_non_moe_tp()` and
173-
`parallelize_model()` for TP integration
172+
- **Tensor Parallelism**: `areal/engine/fsdp_utils/parallel.py` - `apply_non_moe_tp()`
173+
and `parallelize_model()` for TP integration
174174
- **Sequence Parallelism (Ulysses)**: `areal/models/fsdp/ulysses.py` - Ulysses SP
175175
communication primitives and input preparation
176176
- **Context Parallelism**: Integrated via Ulysses sequence parallel groups
@@ -183,9 +183,9 @@ algorithm-specific subclasses
183183

184184
**FSDP2 Wrapping and Sharding**:
185185

186-
- `areal/utils/fsdp/__init__.py` - `apply_fsdp2()` for FSDP2 module wrapping with mixed
187-
precision and offload policies
188-
- `areal/utils/fsdp/parallel.py` - `parallelize_model()` orchestrates TP + FSDP2
186+
- `areal/engine/fsdp_utils/__init__.py` - `apply_fsdp2()` for FSDP2 module wrapping with
187+
mixed precision and offload policies
188+
- `areal/engine/fsdp_utils/parallel.py` - `parallelize_model()` orchestrates TP + FSDP2
189189
application
190190

191191
**Model-Specific Components**:
@@ -198,15 +198,15 @@ algorithm-specific subclasses
198198

199199
**Utilities**:
200200

201-
- **Checkpointing**: `areal/utils/fsdp/checkpoint.py` - `DCPState` wrapper for
201+
- **Checkpointing**: `areal/engine/fsdp_utils/checkpoint.py` - `DCPState` wrapper for
202202
distributed checkpoint (DCP) integration
203-
- **Gradient Handling**: `areal/utils/fsdp/grad.py` - `fsdp2_clip_grad_norm()` with
204-
TP/DP/PP-aware gradient norm computation
205-
- **Optimizer**: `areal/utils/fsdp/optimizer.py` - `AnyPrecisionAdamW` for
203+
- **Gradient Handling**: `areal/engine/fsdp_utils/grad.py` - `fsdp2_clip_grad_norm()`
204+
with TP/DP/PP-aware gradient norm computation
205+
- **Optimizer**: `areal/engine/fsdp_utils/optimizer.py` - `AnyPrecisionAdamW` for
206206
mixed-precision training with Kahan summation
207-
- **Multi-Tensor Operations**: `areal/utils/fsdp/multi_tensor_apply.py` - Fallback
208-
implementations when Transformer Engine/Apex unavailable
209-
- **State Dict Loading**: `areal/utils/fsdp/__init__.py` -
207+
- **Multi-Tensor Operations**: `areal/engine/fsdp_utils/multi_tensor_apply.py` -
208+
Fallback implementations when Transformer Engine/Apex unavailable
209+
- **State Dict Loading**: `areal/engine/fsdp_utils/__init__.py` -
210210
`fsdp2_load_full_state_dict()` for broadcast loading from rank 0
211211

212212
**Algorithm-Specific Subclasses** (in `areal/engine/fsdp_engine.py`):
@@ -249,7 +249,7 @@ algorithm-specific subclasses
249249

250250
- **Main implementation**: `areal/engine/fsdp_engine.py`
251251
- **Configuration**: `areal/api/cli_args.py` (`TrainEngineConfig`)
252-
- **Utilities**: `areal/utils/fsdp/` directory for checkpointing, gradients,
252+
- **Utilities**: `areal/engine/fsdp_utils/` directory for checkpointing, gradients,
253253
optimization, and parallel helpers
254254
- **Model components**: `areal/models/fsdp/` for Ulysses sequence parallelism
255255
- **Examples**: YAML configuration files in `examples/` directory

.claude/agents/launcher-scheduler-expert.md

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -67,7 +67,7 @@ Located in `areal/api/cli_args.py`:
6767
ClusterSpecConfig -> Launcher -> BASE_ENVIRONS + thread vars -> Worker processes
6868
```
6969

70-
Critical utilities in `areal/utils/launcher.py`:
70+
Critical utilities in `areal/infra/utils/launcher.py`:
7171

7272
- `BASE_ENVIRONS`: Essential runtime variables (PyTorch cache, Triton, tokenizers)
7373
- `get_thread_env_vars()`: CPU thread control based on allocated cores
@@ -102,8 +102,8 @@ Critical utilities in `areal/utils/launcher.py`:
102102
IP/hostname assumptions
103103
- Raise specific exceptions from `areal.infra.scheduler.exceptions` -> not generic
104104
exception types
105-
- Use `areal.utils.proc.kill_process_tree()` for process termination -> not leaving
106-
zombie processes
105+
- Use `areal.infra.utils.proc.kill_process_tree()` for process termination -> not
106+
leaving zombie processes
107107
- Propagate all `BASE_ENVIRONS` variables and thread control variables -> not missing
108108
environment variable propagation
109109
- Use `areal.utils.network.find_free_ports()` for port allocation -> not static port
@@ -143,7 +143,7 @@ Critical utilities in `areal/utils/launcher.py`:
143143
| `areal/infra/scheduler/local.py` | Local worker scheduling | GPU round-robin, port allocation, health monitoring |
144144
| `areal/infra/scheduler/slurm.py` | Slurm-integrated scheduling | Job array coordination, resource reservation |
145145
| `areal/infra/scheduler/ray.py` | Ray cluster scheduling | Ray placement groups, actor-based worker management |
146-
| `areal/utils/launcher.py` | Shared utilities | Environment variable management, configuration validation |
146+
| `areal/infra/utils/launcher.py` | Shared utilities | Environment variable management, configuration validation |
147147

148148
______________________________________________________________________
149149

@@ -175,7 +175,7 @@ Activation: Manual (when requested) for launcher/scheduler topics
175175
3. Document any new environment variables or configuration requirements
176176
177177
### When Utility Functions Change
178-
1. Update references to `areal/utils/launcher.py` functions
178+
1. Update references to `areal/infra/utils/launcher.py` functions
179179
2. Adjust "Environment Variable Propagation Pattern" if BASE_ENVIRONS changes
180180
3. Update diagnostic steps that rely on specific utility functions
181181

.claude/agents/megatron-engine-expert.md

Lines changed: 7 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -47,7 +47,7 @@ Key architectural principles:
4747
implementing distributed training coordination
4848
- **`ParallelStrategy`** (`areal/api/alloc_mode.py`): Configuration dataclass for
4949
parallel dimensions
50-
- **`MegatronCheckpointer`** (`areal/utils/megatron_checkpointer.py`): Checkpoint
50+
- **`MegatronCheckpointer`** (`areal/engine/megatron_utils/checkpointer.py`): Checkpoint
5151
handling for distributed state
5252

5353
### Key Methods
@@ -210,12 +210,13 @@ distributed training coordination
210210

211211
- `areal/api/alloc_mode.py` - `MegatronParallelStrategy` (inherits from
212212
`ParallelStrategy`) for Megatron-specific parallel dimensions
213-
- `areal/utils/megatron.py` - Core Megatron utilities and helper functions
213+
- `areal/engine/megatron_utils/megatron.py` - Core Megatron utilities and helper
214+
functions
214215

215216
**Checkpointing and State Management**:
216217

217-
- `areal/utils/megatron_checkpointer.py` - `MegatronCheckpointer` class for distributed
218-
checkpoint handling
218+
- `areal/engine/megatron_utils/checkpointer.py` - `MegatronCheckpointer` class for
219+
distributed checkpoint handling
219220
- Integrated with Megatron Core checkpoint system for pipeline-parallel models
220221

221222
**Model Parallelism Implementations**:
@@ -257,8 +258,8 @@ distributed training coordination
257258
- **Main implementation**: `areal/engine/megatron_engine.py`
258259
- **Configuration**: `areal/api/cli_args.py` (`TrainEngineConfig` with
259260
`MegatronEngineConfig`)
260-
- **Checkpointing**: `areal/utils/megatron_checkpointer.py`
261-
- **Utilities**: `areal/utils/megatron.py` for core Megatron utilities
261+
- **Checkpointing**: `areal/engine/megatron_utils/checkpointer.py`
262+
- **Utilities**: `areal/engine/megatron_utils/megatron.py` for core Megatron utilities
262263
- **Examples**: YAML configuration files in `examples/` and
263264
`areal/tests/sft/config_megatron.yaml`
264265

.claude/data/pr-review-change-types.md

Lines changed: 26 additions & 26 deletions
Original file line numberDiff line numberDiff line change
@@ -7,16 +7,16 @@ ______________________________________________________________________
77

88
## CRITICAL Level (Must use Opus)
99

10-
| Change Type | File Path Pattern | Code Pattern |
11-
| ---------------------- | ------------------------------------------------------------- | ----------------------------------------------------------- |
12-
| **ARCHON_CORE** | `areal/experimental/models/archon/` | - |
13-
| **ARCHON_PARALLEL** | `parallel_dims.py` | `ArchonParallelDims`, `_build_mesh`, `DeviceMesh` |
14-
| **ARCHON_MOE** | `archon/moe/` | `router`, `grouped_experts`, `TokenReorderer`, `grouped_mm` |
15-
| **ARCHON_PARALLELIZE** | `qwen*/infra/parallelize.py` | `apply_moe_ep_tp`, `apply_tp`, `apply_cp` |
16-
| **ARCHON_ENGINE** | `areal/experimental/engine/archon_engine.py` | `ArchonEngine` |
17-
| **FSDP_CORE** | `areal/utils/fsdp/`, `areal/engine/fsdp_engine.py` | `FSDP`, `FullyShardedDataParallel`, `fully_shard` |
18-
| **MEGATRON_CORE** | `areal/engine/megatron_engine.py`, `areal/utils/megatron*.py` | `MegatronEngine` |
19-
| **DCP_CHECKPOINT** | - | `DCP`, `DistributedCheckpoint`, `dcp.save`, `dcp.load` |
10+
| Change Type | File Path Pattern | Code Pattern |
11+
| ---------------------- | ----------------------------------------------------------------- | ----------------------------------------------------------- |
12+
| **ARCHON_CORE** | `areal/experimental/models/archon/` | - |
13+
| **ARCHON_PARALLEL** | `parallel_dims.py` | `ArchonParallelDims`, `_build_mesh`, `DeviceMesh` |
14+
| **ARCHON_MOE** | `archon/moe/` | `router`, `grouped_experts`, `TokenReorderer`, `grouped_mm` |
15+
| **ARCHON_PARALLELIZE** | `qwen*/infra/parallelize.py` | `apply_moe_ep_tp`, `apply_tp`, `apply_cp` |
16+
| **ARCHON_ENGINE** | `areal/experimental/engine/archon_engine.py` | `ArchonEngine` |
17+
| **FSDP_CORE** | `areal/engine/fsdp_utils/`, `areal/engine/fsdp_engine.py` | `FSDP`, `FullyShardedDataParallel`, `fully_shard` |
18+
| **MEGATRON_CORE** | `areal/engine/megatron_engine.py`, `areal/engine/megatron_utils/` | `MegatronEngine` |
19+
| **DCP_CHECKPOINT** | - | `DCP`, `DistributedCheckpoint`, `dcp.save`, `dcp.load` |
2020

2121
## HIGH Level (Recommend Opus)
2222

@@ -33,19 +33,19 @@ ______________________________________________________________________
3333

3434
## MEDIUM Level (Use Sonnet)
3535

36-
| Change Type | File Path Pattern | Code Pattern |
37-
| ----------------------- | ---------------------------------------------------------------------------------- | ------------------------------------------------------------------------ |
38-
| **TENSOR_OPS** | - | `.view(`, `.reshape(`, `dtype=`, `.detach()`, `no_grad`, `.contiguous()` |
39-
| **NUMERICAL** | - | `log(`, `softmax`, `cross_entropy`, `eps=`, `.clamp(`, `nan`, `inf` |
40-
| **WORKFLOW_ENGINE** | `areal/workflow/`, `areal/engine/` | `arun_episode`, `agenerate`, `RolloutWorkflow` |
41-
| **API_CONFIG** | `areal/api/` | `@dataclass`, `__post_init__`, `field(` |
42-
| **COMPILE** | - | `torch.compile`, `_dynamo`, `mark_dynamic`, `fullgraph` |
43-
| **ACTIVATION_CKPT** | `activation_checkpoint.py` | `activation_checkpoint`, `checkpoint_wrapper`, `selective_checkpoint` |
44-
| **CHECKPOINT_RECOVERY** | `areal/utils/saver.py`, `areal/utils/recover.py`, `areal/utils/fsdp/checkpoint.py` | `state_dict`, `load_state_dict`, `checkpoint` |
45-
| **REWARD** | `areal/reward/` | `reward_fn`, `AsyncRewardWrapper`, `MathVerifyWorker` |
46-
| **DATASET** | `areal/dataset/` | `get_*_dataset`, `DataLoader`, `IterableDataset` |
47-
| **LAUNCHER_SCHEDULER** | `areal/infra/launcher/`, `areal/infra/scheduler/`, `areal/infra/rpc/` | `LaunchConfig`, `Scheduler`, `RayLauncher`, `SlurmLauncher` |
48-
| **ATTENTION** | `attention/`, `attention/sdpa.py`, `attention/varlen.py` | `flash_attn`, `sdpa`, `varlen`, `causal_mask` |
36+
| Change Type | File Path Pattern | Code Pattern |
37+
| ----------------------- | ----------------------------------------------------------------------------------------- | ------------------------------------------------------------------------ |
38+
| **TENSOR_OPS** | - | `.view(`, `.reshape(`, `dtype=`, `.detach()`, `no_grad`, `.contiguous()` |
39+
| **NUMERICAL** | - | `log(`, `softmax`, `cross_entropy`, `eps=`, `.clamp(`, `nan`, `inf` |
40+
| **WORKFLOW_ENGINE** | `areal/workflow/`, `areal/engine/` | `arun_episode`, `agenerate`, `RolloutWorkflow` |
41+
| **API_CONFIG** | `areal/api/` | `@dataclass`, `__post_init__`, `field(` |
42+
| **COMPILE** | - | `torch.compile`, `_dynamo`, `mark_dynamic`, `fullgraph` |
43+
| **ACTIVATION_CKPT** | `activation_checkpoint.py` | `activation_checkpoint`, `checkpoint_wrapper`, `selective_checkpoint` |
44+
| **CHECKPOINT_RECOVERY** | `areal/utils/saver.py`, `areal/utils/recover.py`, `areal/engine/fsdp_utils/checkpoint.py` | `state_dict`, `load_state_dict`, `checkpoint` |
45+
| **REWARD** | `areal/reward/` | `reward_fn`, `AsyncRewardWrapper`, `MathVerifyWorker` |
46+
| **DATASET** | `areal/dataset/` | `get_*_dataset`, `DataLoader`, `IterableDataset` |
47+
| **LAUNCHER_SCHEDULER** | `areal/infra/launcher/`, `areal/infra/scheduler/`, `areal/infra/rpc/` | `LaunchConfig`, `Scheduler`, `RayLauncher`, `SlurmLauncher` |
48+
| **ATTENTION** | `attention/`, `attention/sdpa.py`, `attention/varlen.py` | `flash_attn`, `sdpa`, `varlen`, `causal_mask` |
4949

5050
## LOW Level (Use Haiku)
5151

@@ -131,14 +131,14 @@ ______________________________________________________________________
131131

132132
**FSDP Core**:
133133

134-
- `areal/utils/fsdp/`
134+
- `areal/engine/fsdp_utils/`
135135
- `areal/engine/fsdp_engine.py`
136136

137137
**Megatron Core**:
138138

139139
- `areal/engine/megatron_engine.py`
140-
- `areal/utils/megatron.py`
141-
- `areal/utils/megatron_checkpointer.py`
140+
- `areal/engine/megatron_utils/megatron.py`
141+
- `areal/engine/megatron_utils/checkpointer.py`
142142

143143
**Trainer Core**:
144144

.claude/hooks/check-expert-update.sh

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -32,7 +32,7 @@ check_expert_update() {
3232

3333
# FSDP Engine related
3434
if [[ "$file" == *"areal/engine/fsdp_engine"* ]] || \
35-
[[ "$file" == *"areal/utils/fsdp/"* ]]; then
35+
[[ "$file" == *"areal/engine/fsdp_utils/"* ]]; then
3636
reminder_file="fsdp-engine-expert.md"
3737
reminder_desc="FSDP"
3838
fi

.claude/rules/distributed.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@
22
paths:
33
- areal/engine/**
44
- areal/experimental/**
5-
- areal/utils/fsdp/**
5+
- areal/engine/fsdp_utils/**
66
---
77

88
# Distributed Code Rules

.claude/skills/debug-distributed/SKILL.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -98,7 +98,7 @@ dist.barrier()
9898
**Timeout Adjustment** (for debugging only):
9999

100100
```python
101-
from areal.utils.distributed import patch_dist_group_timeout
101+
from areal.engine.core.distributed import patch_dist_group_timeout
102102
from datetime import timedelta
103103
patch_dist_group_timeout(timedelta(minutes=30))
104104
```

AGENTS.md

Lines changed: 9 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -40,10 +40,17 @@ When unsure, leave a `TODO(agent)` comment and note the constraint in your respo
4040
- `areal/infra/` - Core infrastructure including single controller implementation,
4141
placement and allocation policies, async orchestration primitives, and
4242
hardware/platform abstractions for CPU/GPU/NPU runtimes.
43+
- `areal/infra/utils/` - Infrastructure utilities (launcher, process management,
44+
HTTP, concurrency, Slurm/Ray helpers).
4345
- `areal/dataset/` - Stateful dataset loaders (GSM8K, Geometry3K, CLEVR, HH-RLHF,
4446
TORL, etc.) and utilities that feed rollout jobs safely.
4547
- `areal/engine/` - Training backends (FSDP2, Megatron, PPO, SFT, reward modeling) and
4648
inference adapters (SGLang, vLLM remote engines).
49+
- `areal/engine/fsdp_utils/` - FSDP2-specific utilities (checkpoint, gradient,
50+
optimizer, parallel helpers).
51+
- `areal/engine/megatron_utils/` - Megatron/FP8 utilities (checkpoint, pipeline,
52+
quantization).
53+
- `areal/engine/core/` - Engine-shared utilities (distributed, model helpers).
4754
- `areal/experimental/` - Prototype engines/workflows that evolve quickly; expect
4855
breaking changes.
4956
- `areal/infra/launcher/` - Launch specs for local, Ray, and Slurm clusters, plus
@@ -56,8 +63,8 @@ When unsure, leave a `TODO(agent)` comment and note the constraint in your respo
5663
distributed backends).
5764
- `areal/tools/` - Developer utilities and maintenance scripts tied to the core
5865
package.
59-
- `areal/utils/` - Cross-cutting helpers for logging, tensor ops, stats tracking,
60-
checkpoints, and recovery.
66+
- `areal/utils/` - Cross-cutting helpers for logging, data processing, stats tracking,
67+
checkpoints, recovery, network, and RL functional ops.
6168
- `areal/workflow/` - Concrete rollout agents implementing `RolloutWorkflow`:
6269
multi-turn, RLVR, vision RLVR workflows, plus `openai_agent/` for OpenAI Agent-style
6370
implementations.

CLAUDE.md

Lines changed: 7 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -12,10 +12,16 @@ learning.
1212
- `areal/` - Core package
1313
- `api/` - Config dataclasses, workflow/engine contracts
1414
- `engine/` - FSDP2, Megatron, SGLang/vLLM adapters
15+
- `fsdp_utils/` - FSDP2-specific utilities (checkpoint, grad, optimizer, parallel)
16+
- `megatron_utils/` - Megatron/FP8 utilities (checkpoint, pipeline, quantization)
17+
- `core/` - Engine-shared utilities (distributed, lock, model, offload)
18+
- `infra/` - Infrastructure (launcher, scheduler, RPC)
19+
- `utils/` - Infrastructure utilities (launcher, proc, http, concurrent, slurm, ray)
1520
- `workflow/` - RolloutWorkflow implementations
1621
- `reward/` - Reward functions
1722
- `dataset/` - Dataset loaders
18-
- `utils/` - Logging, tensor ops, checkpoints
23+
- `utils/` - Cross-cutting utilities (logging, data, checkpoints, network, RL
24+
functional)
1925
- `examples/` - Training scripts and configs
2026
- `docs/` - Jupyter Book source
2127

0 commit comments

Comments
 (0)