diag(proxy): plumb request_id and exception type into compression failure logs by SwiftWing21 · Pull Request #303 · chopratejas/headroom

SwiftWing21 · 2026-04-28T22:16:10Z

Summary

Refs #296. Observability-only change to make the next bug report on compression-stage timeouts actually root-causable.

The reporter's evidence in #296 includes "Pipeline complete: ... saved 6100 tokens" appearing right before "Optimization failed:" — leading to the natural reading that the pipeline succeeded but its result was discarded. Under concurrent load that reading isn't necessarily true: the pipeline log lines carry no request_id, so they could be from a sibling request. We can't tell from the log alone.

This PR closes that gap without speculating on the root cause:

pipeline.apply accepts request_id via kwargs and prefixes its two INFO log lines ("Pipeline: freezing first N/M ..." and "Pipeline complete: ...") with [request_id] when present. Pre-existing log sites that don't pass it stay unchanged (empty prefix).
The four anthropic_pipeline.apply() call sites in the Anthropic handler now pass request_id through (three in handle_anthropic_messages, one in handle_anthropic_batch_create).
The catch-all warning at anthropic.py:880 now includes the exception type and request_id: [{request_id}] Optimization failed: {type(e).__name__}: {e}. This is the specific reason Anthropic token-mode compression result discarded after pipeline completes; proxy times out and forwards original request #296's evidence shows Optimization failed: with nothing after the colon — str(asyncio.TimeoutError()) is empty.

Validation in the wild

While testing locally I noticed my live ~/.headroom/logs/proxy.log already contained entries in exactly the new format:

[hr_1777396740_000001] Optimization failed: TimeoutError:

Turned out these were test-pollution artifacts from a separate finding (see follow-up comment) — but they confirm the new warning format works as intended end-to-end and would have made the original bug report immediately identifiable as a TimeoutError rather than a silent failure.

Tests

New tests/test_proxy_anthropic_compression_diagnostics.py adds two focused tests:

test_request_id_plumbed_to_pipeline_apply — verifies the handler plumbs request_id into pipeline.apply() via the lightweight _make_proxy_client() pattern + monkeypatched apply.
test_optimization_failure_logs_exception_type — verifies the warning includes the exception type when pipeline.apply raises asyncio.TimeoutError (whose str() is empty). Uses a dedicated _ListHandler because headroom.proxy has propagate=False, so caplog can't see the records via root propagation.

Test plan

ruff check on modified files — clean
ruff format --check on modified files — clean
Full Anthropic handler suite passes — 41/41 across cache stability (17), diagnostics (2), stage timings (3), pre-upstream backpressure (20). No regressions.
CI: cargo fmt / clippy / Rust workspace tests (not run locally — Python-only change)

Scope

This is intentionally diagnostics-only. The underlying timeout still needs reproduction at 367k+ token transcripts. My informed guess (thread-pool starvation under anthropic_pre_upstream_concurrency=8 with several large concurrent compressions, since asyncio.to_thread uses the default executor and the work is GIL-heavy) would be a speculative fix without a repro. With these diagnostics shipped, the next bug report will tell us whether the failing request actually reached the pipeline — and that's the question root-causing #296 needs answered first.

🤖 Generated with Claude Code

SwiftWing21 · 2026-04-28T22:16:45Z

Side finding worth surfacing — test runs pollute the live proxy log

While verifying these diagnostics against my live ~/.headroom/logs/proxy.log, I found that every test run that imports headroom.proxy.server writes to the production log file. The mechanism:

headroom/proxy/helpers.py:189 calls _setup_file_logging() unconditionally at module import. That installs a RotatingFileHandler on the headroom root logger pointing at ~/.headroom/logs/proxy.log, regardless of whether we're inside pytest or running the real proxy.

Concretely: the diagnostics test in this PR (test_optimization_failure_logs_exception_type) deliberately raises TimeoutError to verify the new warning format. When run, it logs:

[hr_<timestamp>_000001] Optimization failed: TimeoutError:

…to the user's real proxy log. With the per-process counter starting at _000001 for each fresh _make_proxy_client() instance, these synthetic failures look like real first-request-of-session failures — and they were initially misleading me into thinking I was reproducing #296 in the wild.

A few possible fixes, in order of how scope-clean they are:

Detect pytest at import — if "PYTEST_CURRENT_TEST" not in os.environ: around the _setup_file_logging() call. One-line, zero behavior change for the proxy.
Move _setup_file_logging() into the create_app() / FastAPI lifespan path so it only runs when the app is actually being served.
Make file logging opt-in via an env var (HEADROOM_FILE_LOGGING=1) or a ProxyConfig field.

I'd lean (1) for minimal scope, but (2) is the more "correct" fix architecturally.

Question for the maintainers before I open a sibling PR: is this worth fixing, and if so which of the three approaches do you prefer? Happy to roll it into this PR if you'd like both shipped together, or keep them separate for reviewability.

🤖 Generated with Claude Code

chopratejas · 2026-04-28T22:19:09Z

Thank you for fixing this - I agree that (2) is much better if you can push it in a follow up PR

chopratejas · 2026-04-28T22:42:29Z

Hey @SwiftWing21 - can you help fix the test failures?

…ilure logs Issue chopratejas#296 reports compression appearing to complete successfully then being discarded after a 30s timeout. The diagnostic gap that makes the report hard to root-cause: 1. The pipeline's "Pipeline complete: ..." and "Pipeline: freezing first ..." log lines have no request_id, so under concurrent load the reporter cannot tell whether the success log is from the timing-out request or a sibling request. 2. The handler's catch-all "Optimization failed: {e}" passes only str(e), which is empty for asyncio.TimeoutError — the report shows "Optimization failed:" with nothing after the colon. This change is observability-only: - pipeline.apply now reads request_id from kwargs and prefixes its two INFO log lines with [request_id] when present. - The four anthropic_pipeline.apply call sites in the Anthropic handler (3 in handle_anthropic_messages, 1 in handle_anthropic_batch_create) pass request_id through. - The catch-all warning becomes "[{request_id}] Optimization failed: {type(e).__name__}: {e}" so TimeoutError is distinguishable from real exceptions in bug reports. No behavior change. Adds two tests covering both diagnostics. This is intentionally not a fix for chopratejas#296 — the underlying timeout still needs reproduction at 367k+ token transcripts. With these diagnostics in place, the next bug report will be able to confirm whether the failing request actually reached the pipeline. Refs chopratejas#296 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

… test The previous _ListHandler approach attached a handler to the headroom.proxy logger and worked locally on Python 3.14, but failed in CI on Python 3.10-3.13 — the warning record never reached the handler. Root cause is unclear (possibly cross-test logger state), but the handler-attachment path is brittle for a single-warning assertion. Replace it with unittest.mock.patch.object on the handler module's logger.warning. This is invariant to logging hierarchy, propagation flags, and per-test logger mutations — we directly observe the call that the production code makes. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

codecov · 2026-04-29T04:07:35Z

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Move _setup_file_logging() from module import to create_app() so importing headroom.proxy.server in tests or library contexts no longer silently attaches a RotatingFileHandler to the user's live proxy.log. This was discovered via PR #303: running the test suite produced "Optimization failed: TimeoutError" entries in ~/.headroom/logs/proxy.log because TestClient instantiations from test_proxy_anthropic_compression_diagnostics.py inherited the module-level handler. Add a regression test that imports the server module in a subprocess and asserts no RotatingFileHandler is attached to the headroom logger. Refs #303 (review thread). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

SwiftWing21 mentioned this pull request Apr 28, 2026

fix(proxy): defer file logging install to create_app() #304

Merged

3 tasks

SwiftWing21 and others added 2 commits April 28, 2026 20:59

SwiftWing21 force-pushed the diag/compression-timeout-296 branch from 5cc8aa1 to 036e424 Compare April 29, 2026 04:00

style(tests): apply ruff format to compression diagnostics test

7f03e5a

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

chopratejas merged commit 1812695 into chopratejas:main Apr 29, 2026
18 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

diag(proxy): plumb request_id and exception type into compression failure logs#303

diag(proxy): plumb request_id and exception type into compression failure logs#303
chopratejas merged 3 commits intochopratejas:mainfrom
SwiftWing21:diag/compression-timeout-296

SwiftWing21 commented Apr 28, 2026

Uh oh!

SwiftWing21 commented Apr 28, 2026

Uh oh!

chopratejas commented Apr 28, 2026

Uh oh!

chopratejas commented Apr 28, 2026

Uh oh!

codecov Bot commented Apr 29, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

SwiftWing21 commented Apr 28, 2026

Summary

Validation in the wild

Tests

Test plan

Scope

Uh oh!

SwiftWing21 commented Apr 28, 2026

Side finding worth surfacing — test runs pollute the live proxy log

Uh oh!

chopratejas commented Apr 28, 2026

Uh oh!

chopratejas commented Apr 28, 2026

Uh oh!

codecov Bot commented Apr 29, 2026

Codecov Report

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants