Skip to content

Commit b658ff3

Browse files
committed
fix(api): replace Literal type with str for SchedulingSpec.ray_placement_strategy
The Literal type annotation breaks omegaconf config loading since omegaconf 2.4.0.dev2 (and later dev versions) don't support Literal in structured configs. This caused ValidationError on any config loading path that touches SchedulingSpec, including scheduler.type=local which doesn't use Ray. Changes: - Change type from Literal["shared", "separate", "deferred"] to str - Add __post_init__ validation to ensure ray_placement_strategy is valid - Remove unused Literal import Fixes #975
1 parent 6860e70 commit b658ff3

3 files changed

Lines changed: 103 additions & 91 deletions

File tree

areal/api/cli_args.py

Lines changed: 11 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,7 @@
55
from dataclasses import asdict, dataclass, field, fields
66
from enum import Enum
77
from pathlib import Path
8-
from typing import TYPE_CHECKING, Any, ClassVar, Literal
8+
from typing import TYPE_CHECKING, Any, ClassVar
99

1010
import uvloop
1111
import yaml
@@ -854,7 +854,7 @@ class SchedulingSpec:
854854
exclude: str | None = field(
855855
default=None, metadata={"help": "sbatch/srun's `--exclude` option for slurm."}
856856
)
857-
ray_placement_strategy: Literal["shared", "separate", "deferred"] = field(
857+
ray_placement_strategy: str = field(
858858
default="shared",
859859
metadata={
860860
"help": "Which placement strategy to use for Ray scheduling. "
@@ -865,6 +865,15 @@ class SchedulingSpec:
865865
},
866866
)
867867

868+
def __post_init__(self):
869+
"""Validate scheduling spec configuration."""
870+
valid_strategies = {"shared", "separate", "deferred"}
871+
if self.ray_placement_strategy not in valid_strategies:
872+
raise ValueError(
873+
f"ray_placement_strategy must be one of {valid_strategies}, "
874+
f"got '{self.ray_placement_strategy}'"
875+
)
876+
868877

869878
@dataclass
870879
class TrainEngineConfig:

docs/en/cli_reference.md

Lines changed: 17 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -945,22 +945,23 @@ Configuration for worker scheduling. Used in the single-controller mode. Experim
945945

946946
Configuration class: SchedulingSpec
947947

948-
| Parameter | Type | Default | Description |
949-
| ---------------------- | ---------------------- | -------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------ |
950-
| `cpu` | integer | `8` | Number of CPU cores required per GPU |
951-
| `gpu` | integer | `0` | Number of GPU units required. Used only when allocating pods. |
952-
| `mem` | integer | `32` | Amount of memory (GB) required per GPU |
953-
| `port_count` | integer | `2` | Number of ports to expose |
954-
| `image` | string | `"/storage/openpsi/images/areal-latest.sif"` | Docker/Singularity container image to use. Currently only used by Slurm. Will be potentially used by Kubernetes in the future. |
955-
| `task_type` | string | `"worker"` | Task type (e.g., worker, engine) **Choices:** `worker`, `engine` |
956-
| `env_vars` | `dict` | **Required** | Environment variables for the container |
957-
| `cmd` | string \| None | `None` | Command to execute inside the container. Defaults to AReaL's RPC server. |
958-
| `srun_additional_args` | string | `"--unbuffered --mpi=pmi2 -K --chdir $PWD"` | Additional arguments to pass to the srun command. Only used by slurm. |
959-
| `additional_bash_cmds` | list of string \| None | `None` | Additional bash commands to setup the container before running the torchrun command. Only used by slurm. |
960-
| `container_type` | string | `"apptainer"` | Type of containers used in slurm **Choices:** `apptainer`, `none` |
961-
| `mount` | string | `"/storage:/storage"` | Mount path for slurm. |
962-
| `nodelist` | string \| None | `None` | sbatch/srun's `--nodelist` option for slurm. |
963-
| `exclude` | string \| None | `None` | sbatch/srun's `--exclude` option for slurm. |
948+
| Parameter | Type | Default | Description |
949+
| ------------------------ | ---------------------- | -------------------------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
950+
| `cpu` | integer | `8` | Number of CPU cores required per GPU |
951+
| `gpu` | integer | `0` | Number of GPU units required. Used only when allocating pods. |
952+
| `mem` | integer | `32` | Amount of memory (GB) required per GPU |
953+
| `port_count` | integer | `2` | Number of ports to expose |
954+
| `image` | string | `"/storage/openpsi/images/areal-latest.sif"` | Docker/Singularity container image to use. Currently only used by Slurm. Will be potentially used by Kubernetes in the future. |
955+
| `task_type` | string | `"worker"` | Task type (e.g., worker, engine) **Choices:** `worker`, `engine` |
956+
| `env_vars` | `dict` | **Required** | Environment variables for the container |
957+
| `cmd` | string \| None | `None` | Command to execute inside the container. Defaults to AReaL's RPC server. |
958+
| `srun_additional_args` | string | `"--unbuffered --mpi=pmi2 -K --chdir $PWD"` | Additional arguments to pass to the srun command. Only used by slurm. |
959+
| `additional_bash_cmds` | list of string \| None | `None` | Additional bash commands to setup the container before running the torchrun command. Only used by slurm. |
960+
| `container_type` | string | `"apptainer"` | Type of containers used in slurm **Choices:** `apptainer`, `none` |
961+
| `mount` | string | `"/storage:/storage"` | Mount path for slurm. |
962+
| `nodelist` | string \| None | `None` | sbatch/srun's `--nodelist` option for slurm. |
963+
| `exclude` | string \| None | `None` | sbatch/srun's `--exclude` option for slurm. |
964+
| `ray_placement_strategy` | string | `"shared"` | Which placement strategy to use for Ray scheduling. Shared will produce 1 placement group for all workers in the role (training). Separate will 1 placement group per worker (rollout). Deferred will do the same as separate but defers accelerator scheduling (multinode rollout). **Choices:** `shared`, `separate`, `deferred` |
964965

965966
(section-scheduling-strategy)=
966967

0 commit comments

Comments
 (0)