Skip to content

diag(proxy): plumb request_id and exception type into compression failure logs#303

Merged
chopratejas merged 3 commits intochopratejas:mainfrom
SwiftWing21:diag/compression-timeout-296
Apr 29, 2026
Merged

diag(proxy): plumb request_id and exception type into compression failure logs#303
chopratejas merged 3 commits intochopratejas:mainfrom
SwiftWing21:diag/compression-timeout-296

Conversation

@SwiftWing21
Copy link
Copy Markdown
Contributor

Summary

Refs #296. Observability-only change to make the next bug report on compression-stage timeouts actually root-causable.

The reporter's evidence in #296 includes "Pipeline complete: ... saved 6100 tokens" appearing right before "Optimization failed:" — leading to the natural reading that the pipeline succeeded but its result was discarded. Under concurrent load that reading isn't necessarily true: the pipeline log lines carry no request_id, so they could be from a sibling request. We can't tell from the log alone.

This PR closes that gap without speculating on the root cause:

  1. pipeline.apply accepts request_id via kwargs and prefixes its two INFO log lines ("Pipeline: freezing first N/M ..." and "Pipeline complete: ...") with [request_id] when present. Pre-existing log sites that don't pass it stay unchanged (empty prefix).
  2. The four anthropic_pipeline.apply() call sites in the Anthropic handler now pass request_id through (three in handle_anthropic_messages, one in handle_anthropic_batch_create).
  3. The catch-all warning at anthropic.py:880 now includes the exception type and request_id: [{request_id}] Optimization failed: {type(e).__name__}: {e}. This is the specific reason Anthropic token-mode compression result discarded after pipeline completes; proxy times out and forwards original request #296's evidence shows Optimization failed: with nothing after the colon — str(asyncio.TimeoutError()) is empty.

Validation in the wild

While testing locally I noticed my live ~/.headroom/logs/proxy.log already contained entries in exactly the new format:

[hr_1777396740_000001] Optimization failed: TimeoutError:

Turned out these were test-pollution artifacts from a separate finding (see follow-up comment) — but they confirm the new warning format works as intended end-to-end and would have made the original bug report immediately identifiable as a TimeoutError rather than a silent failure.

Tests

New tests/test_proxy_anthropic_compression_diagnostics.py adds two focused tests:

  • test_request_id_plumbed_to_pipeline_apply — verifies the handler plumbs request_id into pipeline.apply() via the lightweight _make_proxy_client() pattern + monkeypatched apply.
  • test_optimization_failure_logs_exception_type — verifies the warning includes the exception type when pipeline.apply raises asyncio.TimeoutError (whose str() is empty). Uses a dedicated _ListHandler because headroom.proxy has propagate=False, so caplog can't see the records via root propagation.

Test plan

  • ruff check on modified files — clean
  • ruff format --check on modified files — clean
  • Full Anthropic handler suite passes — 41/41 across cache stability (17), diagnostics (2), stage timings (3), pre-upstream backpressure (20). No regressions.
  • CI: cargo fmt / clippy / Rust workspace tests (not run locally — Python-only change)

Scope

This is intentionally diagnostics-only. The underlying timeout still needs reproduction at 367k+ token transcripts. My informed guess (thread-pool starvation under anthropic_pre_upstream_concurrency=8 with several large concurrent compressions, since asyncio.to_thread uses the default executor and the work is GIL-heavy) would be a speculative fix without a repro. With these diagnostics shipped, the next bug report will tell us whether the failing request actually reached the pipeline — and that's the question root-causing #296 needs answered first.

🤖 Generated with Claude Code

@SwiftWing21
Copy link
Copy Markdown
Contributor Author

Side finding worth surfacing — test runs pollute the live proxy log

While verifying these diagnostics against my live ~/.headroom/logs/proxy.log, I found that every test run that imports headroom.proxy.server writes to the production log file. The mechanism:

headroom/proxy/helpers.py:189 calls _setup_file_logging() unconditionally at module import. That installs a RotatingFileHandler on the headroom root logger pointing at ~/.headroom/logs/proxy.log, regardless of whether we're inside pytest or running the real proxy.

Concretely: the diagnostics test in this PR (test_optimization_failure_logs_exception_type) deliberately raises TimeoutError to verify the new warning format. When run, it logs:

[hr_<timestamp>_000001] Optimization failed: TimeoutError:

…to the user's real proxy log. With the per-process counter starting at _000001 for each fresh _make_proxy_client() instance, these synthetic failures look like real first-request-of-session failures — and they were initially misleading me into thinking I was reproducing #296 in the wild.

A few possible fixes, in order of how scope-clean they are:

  1. Detect pytest at import — if "PYTEST_CURRENT_TEST" not in os.environ: around the _setup_file_logging() call. One-line, zero behavior change for the proxy.
  2. Move _setup_file_logging() into the create_app() / FastAPI lifespan path so it only runs when the app is actually being served.
  3. Make file logging opt-in via an env var (HEADROOM_FILE_LOGGING=1) or a ProxyConfig field.

I'd lean (1) for minimal scope, but (2) is the more "correct" fix architecturally.

Question for the maintainers before I open a sibling PR: is this worth fixing, and if so which of the three approaches do you prefer? Happy to roll it into this PR if you'd like both shipped together, or keep them separate for reviewability.

🤖 Generated with Claude Code

@chopratejas
Copy link
Copy Markdown
Owner

Thank you for fixing this - I agree that (2) is much better if you can push it in a follow up PR

@chopratejas
Copy link
Copy Markdown
Owner

Hey @SwiftWing21 - can you help fix the test failures?

SwiftWing21 and others added 2 commits April 28, 2026 20:59
…ilure logs

Issue chopratejas#296 reports compression appearing to complete successfully then
being discarded after a 30s timeout. The diagnostic gap that makes the
report hard to root-cause:

1. The pipeline's "Pipeline complete: ..." and "Pipeline: freezing
   first ..." log lines have no request_id, so under concurrent load
   the reporter cannot tell whether the success log is from the
   timing-out request or a sibling request.
2. The handler's catch-all "Optimization failed: {e}" passes only
   str(e), which is empty for asyncio.TimeoutError — the report shows
   "Optimization failed:" with nothing after the colon.

This change is observability-only:

- pipeline.apply now reads request_id from kwargs and prefixes its two
  INFO log lines with [request_id] when present.
- The four anthropic_pipeline.apply call sites in the Anthropic handler
  (3 in handle_anthropic_messages, 1 in handle_anthropic_batch_create)
  pass request_id through.
- The catch-all warning becomes
  "[{request_id}] Optimization failed: {type(e).__name__}: {e}" so
  TimeoutError is distinguishable from real exceptions in bug reports.

No behavior change. Adds two tests covering both diagnostics.

This is intentionally not a fix for chopratejas#296 — the underlying timeout still
needs reproduction at 367k+ token transcripts. With these diagnostics in
place, the next bug report will be able to confirm whether the failing
request actually reached the pipeline.

Refs chopratejas#296

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
… test

The previous _ListHandler approach attached a handler to the
headroom.proxy logger and worked locally on Python 3.14, but failed in
CI on Python 3.10-3.13 — the warning record never reached the handler.
Root cause is unclear (possibly cross-test logger state), but the
handler-attachment path is brittle for a single-warning assertion.

Replace it with unittest.mock.patch.object on the handler module's
logger.warning. This is invariant to logging hierarchy, propagation
flags, and per-test logger mutations — we directly observe the call
that the production code makes.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@SwiftWing21 SwiftWing21 force-pushed the diag/compression-timeout-296 branch from 5cc8aa1 to 036e424 Compare April 29, 2026 04:00
@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 29, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@chopratejas chopratejas merged commit 1812695 into chopratejas:main Apr 29, 2026
18 checks passed
chopratejas pushed a commit that referenced this pull request Apr 29, 2026
Move _setup_file_logging() from module import to create_app() so
importing headroom.proxy.server in tests or library contexts no longer
silently attaches a RotatingFileHandler to the user's live proxy.log.

This was discovered via PR #303: running the test suite produced
"Optimization failed: TimeoutError" entries in ~/.headroom/logs/proxy.log
because TestClient instantiations from test_proxy_anthropic_compression_diagnostics.py
inherited the module-level handler.

Add a regression test that imports the server module in a subprocess
and asserts no RotatingFileHandler is attached to the headroom logger.

Refs #303 (review thread).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants