docs: Address PR #381 review feedback#454
docs: Address PR #381 review feedback#454rubambiza wants to merge 3 commits intollm-d-incubation:mainfrom
Conversation
…th naming Rename lukewarm start to "Cold Start (with launcher)" per reviewer feedback that the path is cold, not warm. Split Cold Start into three variants: no FMA, FMA M2 (planned), and with launcher. Rename T_luke_warm metric to T_cold_launcher. Add constituent duration metrics table (T_launcher_schedule, T_launcher_startup, T_dpc_react, T_instance_ready) as planned sub-metrics. Remove obsolete naming note. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Gloire Rubambiza <gloire@ibm.com>
Summary of changes: - Add Cold Start (FMA M2) to Purpose bullet list so it matches the 5-column matrix (was "four" conditions, now "different" conditions) - Clarify T_actuation: non-FMA and M2 cold starts have no FMA-specific sub-components - Drop LPC attribution from Warm Start description -- DPC does not care whether the pre-existing launcher was created by LPC or by a prior DPC reconciliation - Replace "on the correct/assigned GPU" with node-level language in Warm Start and Hot Start (launchers are on Nodes, not GPUs) - Add L2 to Resource Scaling and Stress Test (L1+L3 -> L1+L2+L3) since TTFT cost is low relative to actuation and the data is useful at scale - Remove unused L1+L3 legend entry from matrix; expand L1+L2+L3 description to show how it builds on L1+L2 - Fix Phase 2 T_launcher scope: applies to both warm and cold start with launcher, not just warm Assisted-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Gloire Rubambiza <gloire@ibm.com>
| - **Cold start**: creating a new vLLM instance without using a launcher | ||
| - **Luke warm start**: DPC creates a new launcher pod, then the launcher creates a new vLLM instance | ||
| - **Cold start without FMA**: creating a new vLLM instance without using a launcher | ||
| - **Cold start (FMA M2)**: DPC creates a standalone server-providing pod directly (planned) |
There was a problem hiding this comment.
What does (planned) mean here?
There was a problem hiding this comment.
Why is it that we even need to mention M2 in these docs? I doubt we will ever deploy FMA M2...
| **Metric definitions:** | ||
|
|
||
| - **T_actuation**: Time from requester pod creation (ReplicaSet scale-up) to requester pod readiness (`/ready` probe passes), which implies the DPC has bound the requester to a server-providing pod and the vLLM instance is serving. Spans different sub-components depending on the actuation path: hot start (T_wake), warm start (T_launcher), or luke warm start (T_luke_warm). | ||
| - **T_actuation**: Time from requester pod creation (ReplicaSet scale-up) to requester pod readiness (`/ready` probe passes), which implies the DPC has bound the requester to a server-providing pod and the vLLM instance is serving. For FMA paths, spans different sub-components depending on the actuation path: hot start (T_wake), warm start (T_launcher), or cold start with launcher (T_cold_launcher). For non-FMA and M2 cold starts, T_actuation is measured directly with no FMA-specific sub-components. |
There was a problem hiding this comment.
Adding more confusion to new reader...we could remove and M2 cold starts
| The goal is to quantify and compare how quickly a model-serving duo (server-requesting | ||
| and server-providing pods) becomes available under four different actuation conditions | ||
| and server-providing pods) becomes available under different actuation conditions | ||
| in order of decreasing latency: |
There was a problem hiding this comment.
I think that the order in which the cases actually appear is fine, but I suspect that it is not equal to "decreasing latency". The ordering is: first, code complexity, and second, runtime path length (which I expect will correlate with latency).
Drop Cold Start (FMA M2) from the actuation paths table and matrix. M2 is acknowledged as a distinct path via a note under the paths table but excluded from the benchmarking focus. This simplifies the matrix to 4 columns (no FMA, with launcher, warm, hot) and removes all "(planned)" annotations. Assisted-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Gloire Rubambiza <gloire@ibm.com>
| - **Hit_rate**: Fraction of server-requesting Pods that get satisfied by waking a sleeping vLLM instance. | ||
| - **T_luke_warm**: Time from the DPC requesting launcher pod creation to the new vLLM instance reporting healthy. Covers the full luke warm start span: launcher pod scheduling, launcher readiness, DPC reconciliation, and vLLM instance creation. Measured end-to-end because the boundary between launcher readiness and instance creation is not directly observable from outside the DPC. | ||
| - **T_launcher**: Time from the launcher receiving a create request to the new vLLM instance reporting healthy. Includes the benefit of vLLM module preloading. Applies to the warm start path, where a launcher pod already exists. | ||
| - **T_cold_launcher**: Time from the DPC launcher pod creation to the new vLLM instance reporting healthy. Covers the full cold start (with launcher) span: launcher pod scheduling, launcher startup, and vLLM instance creation. |
There was a problem hiding this comment.
Is this trying to say that T_cold_launch is T_actuation but restricted to cold start with launcher scenarios? In other words, T_actuation is (a) the time from (1) creation of the server-requesting Pod to (2) requester Pod readiness but (b) is only measured for the code start with launcher cases?
If so, then the text currently here is misleading: it suggests a bit less of a span to me. It is also confusing because it says "full ... span".
If not, then this is a different kind of refinement than T_wake: this one covers less of the full path but T_wake is the full path but restricted by actuation case.
| | ------ | ---------- | -------------- | | ||
| | **T_launcher_schedule** | Launcher pod `creationTimestamp` to `PodScheduled` condition `lastTransitionTime` | Kube pod status | | ||
| | **T_launcher_startup** | Launcher pod `PodScheduled` to `Ready` condition `lastTransitionTime` | Kube pod status | | ||
| | **T_dpc_react** | Launcher pod `Ready` to DPC issuing `CreateNamedInstance` | DPC logs (V5: "Creating new vLLM instance") | |
| | ------ | ---------- | -------------- | | ||
| | **T_launcher_schedule** | Launcher pod `creationTimestamp` to `PodScheduled` condition `lastTransitionTime` | Kube pod status | | ||
| | **T_launcher_startup** | Launcher pod `PodScheduled` to `Ready` condition `lastTransitionTime` | Kube pod status | | ||
| | **T_dpc_react** | Launcher pod `Ready` to DPC issuing `CreateNamedInstance` | DPC logs (V5: "Creating new vLLM instance") | |
There was a problem hiding this comment.
To remove the difficulties of parsing the controller log, the controller could produce a Prometheus histogram of this duration. Depending on the level of correlation with other measurements intended, the fact that this is inherently an aggregate may or may not be a problem. If it is a problem then we could consider supporting distributed tracing.
Same for T_instance_ready
|
|
||
| Relationships: | ||
| - T_cold_launcher ≈ T_launcher_schedule + T_launcher_startup + T_dpc_react + T_instance_ready | ||
| - T_launcher ≈ T_dpc_react + T_instance_ready (launcher already Ready) |
There was a problem hiding this comment.
In the warm start actuation path, the right starting gun is not "Launcher pod Ready". As noted, the launcher Pod is ready before the serve-requesting Pod is created.
MikeSpreitzer
left a comment
There was a problem hiding this comment.
I left some individual comments.
Summary
Follow-up to #381, addressing review feedback from Mike and Ansu on the benchmarking scenarios doc.
Test plan