Skip to content

fix: evict orphan-revision routes through RouteEvictionHandler#11370

Merged
HyeockJinKim merged 2 commits intomainfrom
fix/evict-orphan-revision-routes
Apr 28, 2026
Merged

fix: evict orphan-revision routes through RouteEvictionHandler#11370
HyeockJinKim merged 2 commits intomainfrom
fix/evict-orphan-revision-routes

Conversation

@HyeockJinKim
Copy link
Copy Markdown
Collaborator

Summary

  • RouteEvictionHandler previously fired only on RUNNING routes whose health was flagged by the scaling group's cleanup_target_statuses (default UNHEALTHY). Active routes whose revision_id matched neither the endpoint's current_revision nor its deploying_revision (e.g. leftovers from a preempted rollout) were never cleaned up.
  • Broaden the handler's target to every active route (lifecycle=[PROVISIONING, RUNNING], all health statuses) and combine two eviction reasons inside RouteExecutor.cleanup_routes_by_config:
    1. Orphan revision: route.revision_id is not the endpoint's current or deploying revision. Skipped when the endpoint has no revisions known yet (bootstrap state) so freshly created routes are not wiped.
    2. Existing scaling-group health policy: route's health_status is in the resource group's cleanup_target_statuses.
  • A route matched by either reason is marked TERMINATING — same outgoing behaviour as before.

Test plan

  • New unit tests:
    • test_orphan_revision_route_marked_for_cleanup — preempted leftover gets evicted regardless of health
    • test_provisioning_route_for_deploying_revision_kept — active rollout PROVISIONING route is kept
    • test_orphan_check_skipped_when_no_known_revisions — bootstrap endpoint without current/deploying never wipes routes
  • Existing RE-001 / RE-002 / RE-003 continue to pass (now explicitly stub current_revision_id)
  • pants test --changed-since=origin/main --changed-dependents=transitive (all pass)
  • pants fmt / lint / check clean

Note

  • History from_status is recorded as the first lifecycle in target.lifecycle (PROVISIONING after this change). Cosmetic limitation — the actual UPDATE is still gated by RouteConditions.by_lifecycle_statuses(target.lifecycle) so routes only transition out of their real prior status. Worth a follow-up if richer history matters.
  • Sets up the ground for the next PR (issue Missing kernel creation parameter in API Documentation #2: allow set_deploying_revision to override an active deploying revision) — orphan leftovers will be picked up by this handler in the very next route-handler tick.

The route eviction handler used to fire only on RUNNING routes whose
health status the scaling group flagged for cleanup (typically
UNHEALTHY). Active routes whose ``revision_id`` matches neither the
endpoint's ``current_revision`` nor its ``deploying_revision`` were
left alone, so a preempted rollout would leave behind PROVISIONING /
RUNNING routes pointing at a stale revision until something else
happened to them.

Broaden the handler's target to every active route (PROVISIONING and
RUNNING, all health statuses) and combine two eviction reasons inside
``RouteExecutor.cleanup_routes_by_config``:

1. Orphan revision: ``route.revision_id`` is not the endpoint's
   current or deploying revision. Skipped when the endpoint has no
   revisions known yet (transient bootstrap state) so freshly created
   routes are not wiped.
2. Existing scaling-group health policy: ``route.health_status`` is
   listed in the resource group's ``cleanup_target_statuses``.

A route matched by either reason is marked for termination as before.
Copilot AI review requested due to automatic review settings April 28, 2026 04:43
@github-actions github-actions Bot added size:L 100~500 LoC comp:manager Related to Manager component labels Apr 28, 2026
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

This PR expands route eviction to also terminate “orphan-revision” active routes (e.g., leftovers from preempted rollouts) while keeping existing scaling-group health-policy cleanup behavior.

Changes:

  • Broaden RouteEvictionHandler targets to all active routes (PROVISIONING/RUNNING) and all health statuses.
  • Add orphan-revision eviction logic to RouteExecutor.cleanup_routes_by_config (combined with existing health-policy check).
  • Extend unit tests and test utilities to cover orphan-revision scenarios.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
tests/unit/manager/sokovan/deployment/route/executor/test_route_executor.py Adds unit tests for orphan-revision eviction and updates existing tests to set revision IDs explicitly.
tests/unit/manager/sokovan/deployment/route/executor/conftest.py Extends test route factory to accept a specific revision_id for deterministic scenarios.
src/ai/backend/manager/sokovan/deployment/route/handlers/route_eviction.py Updates handler scope/description and expands target statuses to cover all active routes.
src/ai/backend/manager/sokovan/deployment/route/executor.py Implements orphan-revision eviction and combines it with health-policy eviction in cleanup eligibility.
changes/11370.fix.md Documents the behavior change in the changelog.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 758 to 766
with recorder.phase("identify_cleanup_target"):
with recorder.step("check_orphan_revision"):
valid_revisions = deployment_valid_revisions.get(route.deployment_id, set())
if valid_revisions and route.revision_id not in valid_revisions:
return True

with recorder.step("check_cleanup_eligibility"):
cleanup_targets = deployment_cleanup_config.get(route.deployment_id, set())
return route.health_status in cleanup_targets
Copy link

Copilot AI Apr 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The scaling-group health-policy eviction is now applied to all routes passed into cleanup, including PROVISIONING. This changes prior semantics (and the handler docstring says health-based eviction is for RUNNING routes). If health-based eviction should remain RUNNING-only, gate the health-policy check on route.status == RouteStatus.RUNNING (or an explicit allowed lifecycle set) so the new orphan-revision expansion doesn’t inadvertently evict provisioning routes.

Copilot uses AI. Check for mistakes.
Comment on lines 764 to 766
with recorder.step("check_cleanup_eligibility"):
cleanup_targets = deployment_cleanup_config.get(route.deployment_id, set())
return route.health_status in cleanup_targets
Copy link

Copilot AI Apr 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The broadened handler scope + combined eviction logic introduces a new edge case: a PROVISIONING route whose health_status is in cleanup_target_statuses. If the intended behavior is RUNNING-only for health-policy eviction, add a unit test to assert that such a PROVISIONING route is not evicted (and conversely, if eviction is intended, document it explicitly and test it).

Copilot uses AI. Check for mistakes.
Comment on lines +35 to +36
``revision_id`` belongs to neither the endpoint's ``current_revision``
nor its ``deploying_revision``. This catches leftovers from a
Copy link

Copilot AI Apr 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The docstring references current_revision / deploying_revision, but the executor logic and tests use current_revision_id / deploying_revision_id. Please align the docstring attribute names with the actual model fields to avoid confusion for maintainers.

Suggested change
``revision_id`` belongs to neither the endpoint's ``current_revision``
nor its ``deploying_revision``. This catches leftovers from a
``revision_id`` belongs to neither the endpoint's ``current_revision_id``
nor its ``deploying_revision_id``. This catches leftovers from a

Copilot uses AI. Check for mistakes.
Comment thread changes/11370.fix.md
@@ -0,0 +1 @@
Evict orphan-revision routes (active routes whose `revision_id` is neither the endpoint's `current_revision` nor its `deploying_revision`) through `RouteEvictionHandler` alongside the existing scaling-group health-policy eviction, so a preempted rollout no longer leaves behind PROVISIONING / RUNNING routes pointing at a stale revision.
Copy link

Copilot AI Apr 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changelog refers to endpoint fields as current_revision / deploying_revision, while the code uses current_revision_id / deploying_revision_id. Consider updating the changelog to match the actual field names (or clarify that these refer to revision IDs) to prevent ambiguity in release notes.

Suggested change
Evict orphan-revision routes (active routes whose `revision_id` is neither the endpoint's `current_revision` nor its `deploying_revision`) through `RouteEvictionHandler` alongside the existing scaling-group health-policy eviction, so a preempted rollout no longer leaves behind PROVISIONING / RUNNING routes pointing at a stale revision.
Evict orphan-revision routes (active routes whose `revision_id` is neither the endpoint's `current_revision_id` nor its `deploying_revision_id`) through `RouteEvictionHandler` alongside the existing scaling-group health-policy eviction, so a preempted rollout no longer leaves behind PROVISIONING / RUNNING routes pointing at a stale revision.

Copilot uses AI. Check for mistakes.
@HyeockJinKim HyeockJinKim merged commit 7c9b827 into main Apr 28, 2026
33 checks passed
@HyeockJinKim HyeockJinKim deleted the fix/evict-orphan-revision-routes branch April 28, 2026 05:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comp:manager Related to Manager component size:L 100~500 LoC

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants