fix: evict orphan-revision routes through RouteEvictionHandler#11370
fix: evict orphan-revision routes through RouteEvictionHandler#11370HyeockJinKim merged 2 commits intomainfrom
Conversation
The route eviction handler used to fire only on RUNNING routes whose health status the scaling group flagged for cleanup (typically UNHEALTHY). Active routes whose ``revision_id`` matches neither the endpoint's ``current_revision`` nor its ``deploying_revision`` were left alone, so a preempted rollout would leave behind PROVISIONING / RUNNING routes pointing at a stale revision until something else happened to them. Broaden the handler's target to every active route (PROVISIONING and RUNNING, all health statuses) and combine two eviction reasons inside ``RouteExecutor.cleanup_routes_by_config``: 1. Orphan revision: ``route.revision_id`` is not the endpoint's current or deploying revision. Skipped when the endpoint has no revisions known yet (transient bootstrap state) so freshly created routes are not wiped. 2. Existing scaling-group health policy: ``route.health_status`` is listed in the resource group's ``cleanup_target_statuses``. A route matched by either reason is marked for termination as before.
There was a problem hiding this comment.
Pull request overview
Note
Copilot was unable to run its full agentic suite in this review.
This PR expands route eviction to also terminate “orphan-revision” active routes (e.g., leftovers from preempted rollouts) while keeping existing scaling-group health-policy cleanup behavior.
Changes:
- Broaden
RouteEvictionHandlertargets to all active routes (PROVISIONING/RUNNING) and all health statuses. - Add orphan-revision eviction logic to
RouteExecutor.cleanup_routes_by_config(combined with existing health-policy check). - Extend unit tests and test utilities to cover orphan-revision scenarios.
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/unit/manager/sokovan/deployment/route/executor/test_route_executor.py | Adds unit tests for orphan-revision eviction and updates existing tests to set revision IDs explicitly. |
| tests/unit/manager/sokovan/deployment/route/executor/conftest.py | Extends test route factory to accept a specific revision_id for deterministic scenarios. |
| src/ai/backend/manager/sokovan/deployment/route/handlers/route_eviction.py | Updates handler scope/description and expands target statuses to cover all active routes. |
| src/ai/backend/manager/sokovan/deployment/route/executor.py | Implements orphan-revision eviction and combines it with health-policy eviction in cleanup eligibility. |
| changes/11370.fix.md | Documents the behavior change in the changelog. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| with recorder.phase("identify_cleanup_target"): | ||
| with recorder.step("check_orphan_revision"): | ||
| valid_revisions = deployment_valid_revisions.get(route.deployment_id, set()) | ||
| if valid_revisions and route.revision_id not in valid_revisions: | ||
| return True | ||
|
|
||
| with recorder.step("check_cleanup_eligibility"): | ||
| cleanup_targets = deployment_cleanup_config.get(route.deployment_id, set()) | ||
| return route.health_status in cleanup_targets |
There was a problem hiding this comment.
The scaling-group health-policy eviction is now applied to all routes passed into cleanup, including PROVISIONING. This changes prior semantics (and the handler docstring says health-based eviction is for RUNNING routes). If health-based eviction should remain RUNNING-only, gate the health-policy check on route.status == RouteStatus.RUNNING (or an explicit allowed lifecycle set) so the new orphan-revision expansion doesn’t inadvertently evict provisioning routes.
| with recorder.step("check_cleanup_eligibility"): | ||
| cleanup_targets = deployment_cleanup_config.get(route.deployment_id, set()) | ||
| return route.health_status in cleanup_targets |
There was a problem hiding this comment.
The broadened handler scope + combined eviction logic introduces a new edge case: a PROVISIONING route whose health_status is in cleanup_target_statuses. If the intended behavior is RUNNING-only for health-policy eviction, add a unit test to assert that such a PROVISIONING route is not evicted (and conversely, if eviction is intended, document it explicitly and test it).
| ``revision_id`` belongs to neither the endpoint's ``current_revision`` | ||
| nor its ``deploying_revision``. This catches leftovers from a |
There was a problem hiding this comment.
The docstring references current_revision / deploying_revision, but the executor logic and tests use current_revision_id / deploying_revision_id. Please align the docstring attribute names with the actual model fields to avoid confusion for maintainers.
| ``revision_id`` belongs to neither the endpoint's ``current_revision`` | |
| nor its ``deploying_revision``. This catches leftovers from a | |
| ``revision_id`` belongs to neither the endpoint's ``current_revision_id`` | |
| nor its ``deploying_revision_id``. This catches leftovers from a |
| @@ -0,0 +1 @@ | |||
| Evict orphan-revision routes (active routes whose `revision_id` is neither the endpoint's `current_revision` nor its `deploying_revision`) through `RouteEvictionHandler` alongside the existing scaling-group health-policy eviction, so a preempted rollout no longer leaves behind PROVISIONING / RUNNING routes pointing at a stale revision. | |||
There was a problem hiding this comment.
The changelog refers to endpoint fields as current_revision / deploying_revision, while the code uses current_revision_id / deploying_revision_id. Consider updating the changelog to match the actual field names (or clarify that these refer to revision IDs) to prevent ambiguity in release notes.
| Evict orphan-revision routes (active routes whose `revision_id` is neither the endpoint's `current_revision` nor its `deploying_revision`) through `RouteEvictionHandler` alongside the existing scaling-group health-policy eviction, so a preempted rollout no longer leaves behind PROVISIONING / RUNNING routes pointing at a stale revision. | |
| Evict orphan-revision routes (active routes whose `revision_id` is neither the endpoint's `current_revision_id` nor its `deploying_revision_id`) through `RouteEvictionHandler` alongside the existing scaling-group health-policy eviction, so a preempted rollout no longer leaves behind PROVISIONING / RUNNING routes pointing at a stale revision. |
Summary
RouteEvictionHandlerpreviously fired only on RUNNING routes whose health was flagged by the scaling group'scleanup_target_statuses(default UNHEALTHY). Active routes whoserevision_idmatched neither the endpoint'scurrent_revisionnor itsdeploying_revision(e.g. leftovers from a preempted rollout) were never cleaned up.lifecycle=[PROVISIONING, RUNNING], all health statuses) and combine two eviction reasons insideRouteExecutor.cleanup_routes_by_config:route.revision_idis not the endpoint's current or deploying revision. Skipped when the endpoint has no revisions known yet (bootstrap state) so freshly created routes are not wiped.health_statusis in the resource group'scleanup_target_statuses.Test plan
test_orphan_revision_route_marked_for_cleanup— preempted leftover gets evicted regardless of healthtest_provisioning_route_for_deploying_revision_kept— active rollout PROVISIONING route is kepttest_orphan_check_skipped_when_no_known_revisions— bootstrap endpoint without current/deploying never wipes routesRE-001 / RE-002 / RE-003continue to pass (now explicitly stubcurrent_revision_id)pants test --changed-since=origin/main --changed-dependents=transitive(all pass)pants fmt / lint / checkcleanNote
from_statusis recorded as the first lifecycle intarget.lifecycle(PROVISIONING after this change). Cosmetic limitation — the actual UPDATE is still gated byRouteConditions.by_lifecycle_statuses(target.lifecycle)so routes only transition out of their real prior status. Worth a follow-up if richer history matters.set_deploying_revisionto override an active deploying revision) — orphan leftovers will be picked up by this handler in the very next route-handler tick.