diag(proxy): plumb request_id and exception type into compression failure logs#303
Conversation
Side finding worth surfacing — test runs pollute the live proxy logWhile verifying these diagnostics against my live
Concretely: the diagnostics test in this PR ( …to the user's real proxy log. With the per-process counter starting at A few possible fixes, in order of how scope-clean they are:
I'd lean (1) for minimal scope, but (2) is the more "correct" fix architecturally. Question for the maintainers before I open a sibling PR: is this worth fixing, and if so which of the three approaches do you prefer? Happy to roll it into this PR if you'd like both shipped together, or keep them separate for reviewability. 🤖 Generated with Claude Code |
|
Thank you for fixing this - I agree that (2) is much better if you can push it in a follow up PR |
|
Hey @SwiftWing21 - can you help fix the test failures? |
…ilure logs Issue chopratejas#296 reports compression appearing to complete successfully then being discarded after a 30s timeout. The diagnostic gap that makes the report hard to root-cause: 1. The pipeline's "Pipeline complete: ..." and "Pipeline: freezing first ..." log lines have no request_id, so under concurrent load the reporter cannot tell whether the success log is from the timing-out request or a sibling request. 2. The handler's catch-all "Optimization failed: {e}" passes only str(e), which is empty for asyncio.TimeoutError — the report shows "Optimization failed:" with nothing after the colon. This change is observability-only: - pipeline.apply now reads request_id from kwargs and prefixes its two INFO log lines with [request_id] when present. - The four anthropic_pipeline.apply call sites in the Anthropic handler (3 in handle_anthropic_messages, 1 in handle_anthropic_batch_create) pass request_id through. - The catch-all warning becomes "[{request_id}] Optimization failed: {type(e).__name__}: {e}" so TimeoutError is distinguishable from real exceptions in bug reports. No behavior change. Adds two tests covering both diagnostics. This is intentionally not a fix for chopratejas#296 — the underlying timeout still needs reproduction at 367k+ token transcripts. With these diagnostics in place, the next bug report will be able to confirm whether the failing request actually reached the pipeline. Refs chopratejas#296 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
… test The previous _ListHandler approach attached a handler to the headroom.proxy logger and worked locally on Python 3.14, but failed in CI on Python 3.10-3.13 — the warning record never reached the handler. Root cause is unclear (possibly cross-test logger state), but the handler-attachment path is brittle for a single-warning assertion. Replace it with unittest.mock.patch.object on the handler module's logger.warning. This is invariant to logging hierarchy, propagation flags, and per-test logger mutations — we directly observe the call that the production code makes. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
5cc8aa1 to
036e424
Compare
Codecov Report✅ All modified and coverable lines are covered by tests. 📢 Thoughts on this report? Let us know! |
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Move _setup_file_logging() from module import to create_app() so importing headroom.proxy.server in tests or library contexts no longer silently attaches a RotatingFileHandler to the user's live proxy.log. This was discovered via PR #303: running the test suite produced "Optimization failed: TimeoutError" entries in ~/.headroom/logs/proxy.log because TestClient instantiations from test_proxy_anthropic_compression_diagnostics.py inherited the module-level handler. Add a regression test that imports the server module in a subprocess and asserts no RotatingFileHandler is attached to the headroom logger. Refs #303 (review thread). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Summary
Refs #296. Observability-only change to make the next bug report on compression-stage timeouts actually root-causable.
The reporter's evidence in #296 includes "Pipeline complete: ... saved 6100 tokens" appearing right before "Optimization failed:" — leading to the natural reading that the pipeline succeeded but its result was discarded. Under concurrent load that reading isn't necessarily true: the pipeline log lines carry no request_id, so they could be from a sibling request. We can't tell from the log alone.
This PR closes that gap without speculating on the root cause:
pipeline.applyacceptsrequest_idvia kwargs and prefixes its two INFO log lines ("Pipeline: freezing first N/M ..." and "Pipeline complete: ...") with[request_id]when present. Pre-existing log sites that don't pass it stay unchanged (empty prefix).anthropic_pipeline.apply()call sites in the Anthropic handler now passrequest_idthrough (three inhandle_anthropic_messages, one inhandle_anthropic_batch_create).anthropic.py:880now includes the exception type and request_id:[{request_id}] Optimization failed: {type(e).__name__}: {e}. This is the specific reason Anthropic token-mode compression result discarded after pipeline completes; proxy times out and forwards original request #296's evidence showsOptimization failed:with nothing after the colon —str(asyncio.TimeoutError())is empty.Validation in the wild
While testing locally I noticed my live
~/.headroom/logs/proxy.logalready contained entries in exactly the new format:Turned out these were test-pollution artifacts from a separate finding (see follow-up comment) — but they confirm the new warning format works as intended end-to-end and would have made the original bug report immediately identifiable as a TimeoutError rather than a silent failure.
Tests
New
tests/test_proxy_anthropic_compression_diagnostics.pyadds two focused tests:test_request_id_plumbed_to_pipeline_apply— verifies the handler plumbsrequest_idintopipeline.apply()via the lightweight_make_proxy_client()pattern + monkeypatchedapply.test_optimization_failure_logs_exception_type— verifies the warning includes the exception type whenpipeline.applyraisesasyncio.TimeoutError(whosestr()is empty). Uses a dedicated_ListHandlerbecauseheadroom.proxyhaspropagate=False, so caplog can't see the records via root propagation.Test plan
ruff checkon modified files — cleanruff format --checkon modified files — cleanScope
This is intentionally diagnostics-only. The underlying timeout still needs reproduction at 367k+ token transcripts. My informed guess (thread-pool starvation under
anthropic_pre_upstream_concurrency=8with several large concurrent compressions, sinceasyncio.to_threaduses the default executor and the work is GIL-heavy) would be a speculative fix without a repro. With these diagnostics shipped, the next bug report will tell us whether the failing request actually reached the pipeline — and that's the question root-causing #296 needs answered first.🤖 Generated with Claude Code