Skip to content

docs: Address PR #381 review feedback#454

Open
rubambiza wants to merge 3 commits intollm-d-incubation:mainfrom
rubambiza:docs/benchmark-review-followup
Open

docs: Address PR #381 review feedback#454
rubambiza wants to merge 3 commits intollm-d-incubation:mainfrom
rubambiza:docs/benchmark-review-followup

Conversation

@rubambiza
Copy link
Copy Markdown
Collaborator

Summary

Follow-up to #381, addressing review feedback from Mike and Ansu on the benchmarking scenarios doc.

  • Rename lukewarm to "Cold Start (with launcher)": the path is cold, not warm. Metric renamed from T_luke_warm to T_cold_launcher.
  • Split Cold Start into three variants: Cold Start (no FMA), Cold Start (FMA M2, planned), Cold Start (with launcher). M2 benchmarking is flagged as planned, pending a stable M3 harness.
  • Add constituent duration metrics table: T_launcher_schedule, T_launcher_startup, T_dpc_react, T_instance_ready as planned sub-metrics with observability sources.
  • Add L2 to Resource Scaling and Stress Test: TTFT cost per requester is low relative to actuation time, and the data is useful at scale.
  • Drop LPC attribution from Warm Start: DPC does not care whether the pre-existing launcher was created by LPC or by a prior DPC reconciliation.
  • Fix node-level language: launchers are on Nodes, not GPUs (Warm Start and Hot Start descriptions).
  • Remove unused L1+L3 legend entry; expand L1+L2+L3 description.
  • Fix Phase 2 T_launcher scope: applies to both warm and cold start with launcher paths.
  • Remove obsolete naming note (no longer needed after rename).

Test plan

  • Verify markdown tables render correctly
  • Confirm all 5 actuation path columns match between Paths table and Matrix
  • Confirm metric names are consistent across definitions, L1 legend, and Integration Phases

…th naming

Rename lukewarm start to "Cold Start (with launcher)" per reviewer
feedback that the path is cold, not warm. Split Cold Start into three
variants: no FMA, FMA M2 (planned), and with launcher. Rename
T_luke_warm metric to T_cold_launcher. Add constituent duration metrics
table (T_launcher_schedule, T_launcher_startup, T_dpc_react,
T_instance_ready) as planned sub-metrics. Remove obsolete naming note.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Signed-off-by: Gloire Rubambiza <gloire@ibm.com>
Summary of changes:
- Add Cold Start (FMA M2) to Purpose bullet list so it matches the
  5-column matrix (was "four" conditions, now "different" conditions)
- Clarify T_actuation: non-FMA and M2 cold starts have no FMA-specific
  sub-components
- Drop LPC attribution from Warm Start description -- DPC does not care
  whether the pre-existing launcher was created by LPC or by a prior
  DPC reconciliation
- Replace "on the correct/assigned GPU" with node-level language in
  Warm Start and Hot Start (launchers are on Nodes, not GPUs)
- Add L2 to Resource Scaling and Stress Test (L1+L3 -> L1+L2+L3) since
  TTFT cost is low relative to actuation and the data is useful at scale
- Remove unused L1+L3 legend entry from matrix; expand L1+L2+L3
  description to show how it builds on L1+L2
- Fix Phase 2 T_launcher scope: applies to both warm and cold start
  with launcher, not just warm

Assisted-By: Claude Opus 4.6 <noreply@anthropic.com>

Signed-off-by: Gloire Rubambiza <gloire@ibm.com>
Comment thread inference_server/benchmark/benchmark.md Outdated
- **Cold start**: creating a new vLLM instance without using a launcher
- **Luke warm start**: DPC creates a new launcher pod, then the launcher creates a new vLLM instance
- **Cold start without FMA**: creating a new vLLM instance without using a launcher
- **Cold start (FMA M2)**: DPC creates a standalone server-providing pod directly (planned)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does (planned) mean here?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is it that we even need to mention M2 in these docs? I doubt we will ever deploy FMA M2...

Comment thread inference_server/benchmark/benchmark.md Outdated
**Metric definitions:**

- **T_actuation**: Time from requester pod creation (ReplicaSet scale-up) to requester pod readiness (`/ready` probe passes), which implies the DPC has bound the requester to a server-providing pod and the vLLM instance is serving. Spans different sub-components depending on the actuation path: hot start (T_wake), warm start (T_launcher), or luke warm start (T_luke_warm).
- **T_actuation**: Time from requester pod creation (ReplicaSet scale-up) to requester pod readiness (`/ready` probe passes), which implies the DPC has bound the requester to a server-providing pod and the vLLM instance is serving. For FMA paths, spans different sub-components depending on the actuation path: hot start (T_wake), warm start (T_launcher), or cold start with launcher (T_cold_launcher). For non-FMA and M2 cold starts, T_actuation is measured directly with no FMA-specific sub-components.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding more confusion to new reader...we could remove and M2 cold starts

The goal is to quantify and compare how quickly a model-serving duo (server-requesting
and server-providing pods) becomes available under four different actuation conditions
and server-providing pods) becomes available under different actuation conditions
in order of decreasing latency:
Copy link
Copy Markdown
Collaborator

@MikeSpreitzer MikeSpreitzer Apr 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that the order in which the cases actually appear is fine, but I suspect that it is not equal to "decreasing latency". The ordering is: first, code complexity, and second, runtime path length (which I expect will correlate with latency).

Drop Cold Start (FMA M2) from the actuation paths table and matrix.
M2 is acknowledged as a distinct path via a note under the paths
table but excluded from the benchmarking focus. This simplifies the
matrix to 4 columns (no FMA, with launcher, warm, hot) and removes
all "(planned)" annotations.

Assisted-By: Claude Opus 4.6 <noreply@anthropic.com>

Signed-off-by: Gloire Rubambiza <gloire@ibm.com>
- **Hit_rate**: Fraction of server-requesting Pods that get satisfied by waking a sleeping vLLM instance.
- **T_luke_warm**: Time from the DPC requesting launcher pod creation to the new vLLM instance reporting healthy. Covers the full luke warm start span: launcher pod scheduling, launcher readiness, DPC reconciliation, and vLLM instance creation. Measured end-to-end because the boundary between launcher readiness and instance creation is not directly observable from outside the DPC.
- **T_launcher**: Time from the launcher receiving a create request to the new vLLM instance reporting healthy. Includes the benefit of vLLM module preloading. Applies to the warm start path, where a launcher pod already exists.
- **T_cold_launcher**: Time from the DPC launcher pod creation to the new vLLM instance reporting healthy. Covers the full cold start (with launcher) span: launcher pod scheduling, launcher startup, and vLLM instance creation.
Copy link
Copy Markdown
Collaborator

@MikeSpreitzer MikeSpreitzer Apr 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this trying to say that T_cold_launch is T_actuation but restricted to cold start with launcher scenarios? In other words, T_actuation is (a) the time from (1) creation of the server-requesting Pod to (2) requester Pod readiness but (b) is only measured for the code start with launcher cases?

If so, then the text currently here is misleading: it suggests a bit less of a span to me. It is also confusing because it says "full ... span".

If not, then this is a different kind of refinement than T_wake: this one covers less of the full path but T_wake is the full path but restricted by actuation case.

| ------ | ---------- | -------------- |
| **T_launcher_schedule** | Launcher pod `creationTimestamp` to `PodScheduled` condition `lastTransitionTime` | Kube pod status |
| **T_launcher_startup** | Launcher pod `PodScheduled` to `Ready` condition `lastTransitionTime` | Kube pod status |
| **T_dpc_react** | Launcher pod `Ready` to DPC issuing `CreateNamedInstance` | DPC logs (V5: "Creating new vLLM instance") |
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is "V5"?

| ------ | ---------- | -------------- |
| **T_launcher_schedule** | Launcher pod `creationTimestamp` to `PodScheduled` condition `lastTransitionTime` | Kube pod status |
| **T_launcher_startup** | Launcher pod `PodScheduled` to `Ready` condition `lastTransitionTime` | Kube pod status |
| **T_dpc_react** | Launcher pod `Ready` to DPC issuing `CreateNamedInstance` | DPC logs (V5: "Creating new vLLM instance") |
Copy link
Copy Markdown
Collaborator

@MikeSpreitzer MikeSpreitzer Apr 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To remove the difficulties of parsing the controller log, the controller could produce a Prometheus histogram of this duration. Depending on the level of correlation with other measurements intended, the fact that this is inherently an aggregate may or may not be a problem. If it is a problem then we could consider supporting distributed tracing.

Same for T_instance_ready


Relationships:
- T_cold_launcher ≈ T_launcher_schedule + T_launcher_startup + T_dpc_react + T_instance_ready
- T_launcher ≈ T_dpc_react + T_instance_ready (launcher already Ready)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the warm start actuation path, the right starting gun is not "Launcher pod Ready". As noted, the launcher Pod is ready before the serve-requesting Pod is created.

Copy link
Copy Markdown
Collaborator

@MikeSpreitzer MikeSpreitzer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left some individual comments.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants