fix(smart_crusher): re-attach TOIN learning loop + audit known regressions by chopratejas · Pull Request #300 · chopratejas/headroom

chopratejas · 2026-04-28T18:29:18Z

Summary

Audit fix for the silent regressions introduced when SmartCrusher's Python implementation was retired in Stage 3c.1b. Three subsystems shipped disconnected; this PR fixes one outright, labels the other two visibly, and adds a RUST_DEV.md section so future audits don't have to rediscover them.

This is not a Rust change — pure Python shim work, lands directly on main.

What was broken

Subsystem	Before	After
TOIN learning for SmartCrusher	Silently disconnected. ContentRouter skipped TOIN recording for SmartCrusher on the assumption SmartCrusher recorded its own (the retired Python class did). The Rust port doesn't know about TOIN. Net: the highest-traffic compression strategy stopped fueling the learning loop.	Re-attached. Shim's `crush()` and `_smart_crush_content()` call `toin.record_compression()` after real compressions (filter on `strategy != "passthrough"` to ignore JSON re-canonicalization).
`ccr_config.inject_retrieval_marker=False`	Silently ignored. Rust emits `<<ccr:HASH ...>>` markers in `dropped_summary` unconditionally.	Same as before, but now logs a WARNING when callers pass `False` so the mismatch is visible. Fix needs a Rust-side gate; tracked in RUST_DEV.md.
Custom `relevance_config` / `scorer` overrides	Logged at debug level then dropped.	Bumped to WARNING. Same root cause as above; needs Rust-side support.

Tests

7 regression tests in `tests/test_smart_crusher_toin_attachment.py`:

`crush()` records on real compression, skips passthrough
structurally-similar inputs land on the same pattern (TOIN aggregation works)
`_smart_crush_content()` records (legacy `apply()` path)
TOIN errors are non-fatal (compression still completes)
non-JSON input doesn't record
`inject_retrieval_marker=False` emits the WARNING

Test plan

7 new tests pass
No regressions in `test_smart_crusher_*` (41 tests across SmartCrusher / parity / CCR / TOIN integration)
`make ci-precheck` PASSED — 185 python tests + rust workspace + commitlint

Why this lands separately (not stacked on the in-flight PRs)

This is a bug fix for already-shipped retirements (`Stage 3c.1b`, merged April 27). The fix is independent of #295 (ContentDetector) and #299 (ContentRouter types) which are forward-looking ports. Splitting keeps the audit visible and reviewable on its own.

…sions The Stage 3c.1b retirement of the Python SmartCrusher silently disconnected three subsystems. The audit on 2026-04-28 caught them; this commit fixes what's fixable today and labels the rest visibly. ## TOIN learning loop — fixed Before: `ContentRouter._record_to_toin` skipped SmartCrusher on the assumption SmartCrusher recorded its own TOIN events. The retired Python class did. The Rust port doesn't know about TOIN. Net result: the highest-traffic compression strategy stopped fueling the learning loop, silently. Fix: shim's `crush()` and `_smart_crush_content()` now call `toin.record_compression()` after a real compression. Filtered on `strategy != "passthrough"` because the Rust port flips `was_modified=True` from JSON whitespace re-canonicalization. Best effort: TOIN failures are logged at debug level and never break compression. Token estimates use `len(json) // 4` (the rule the retired Python used) because the router doesn't pass a tokenizer down to this layer and re-tokenizing here would dominate the recording cost. 7 tests in `tests/test_smart_crusher_toin_attachment.py`: - crush() records on real compression, doesn't on passthrough - structurally-similar inputs land on the same pattern - _smart_crush_content() records (legacy apply() path) - TOIN errors don't break compression - non-JSON input doesn't record - inject_retrieval_marker=False emits a WARNING ## CCR marker emission knob — labelled, not yet fixed `ccr_config.inject_retrieval_marker=False` is not honored — the Rust port emits `<<ccr:HASH N_rows_offloaded>>` markers in `dropped_summary` unconditionally. Today the production default has the flag True so no one is hitting the gap, but the silent-disconnect was a real regression. Shim now logs a WARNING when callers pass `False` so the mismatch is visible. Fix needs a Rust-side gate; tracked in `RUST_DEV.md`. ## Custom relevance scorer — labelled, not yet fixed `relevance_config` and `scorer` constructor args are accepted for source compatibility but the Rust default `HybridScorer` always runs. Shim was logging this at debug; bumped to WARNING. Tracked. ## RUST_DEV.md New "Known regressions in retired-Python components" section with a table per retired component. The intent is that this section gets updated whenever a regression closes (or a new one is found), so the duplicate-codebase tax stays visible instead of decaying into folklore.

…tion `caplog` flakes under the full suite — earlier tests can leave `logging.disable()` or root-handler state that suppresses records on the named logger before pytest's caplog filter sees them. The test passed in isolation but failed deterministically once the larger suite preceded it (CI run 25070639932 across all four Python versions, ~4500 tests in). Fix: monkeypatch `headroom.transforms.smart_crusher.logger.warning` directly. Bypasses every level/disable/handler concern; we only assert the constructor *called* `logger.warning` with the expected flag name in the message — which is exactly the contract this test guards.

Closes the second-of-three gaps the audit (#300) labelled. The Rust port now has a real on/off switch for CCR row-drop marker emission; the Python shim flips it from `ccr_config.inject_retrieval_marker`. - New `enable_ccr_marker: bool` (default `true`) on `SmartCrusherConfig`. Default preserves existing behavior — every recorded parity fixture still matches byte-for-byte because the gate was added at the lossy path's marker emission site, not the decision tree. - `crusher.rs::crush_array` checks the flag before emitting the `<<ccr:HASH N_rows_offloaded>>` marker AND before writing the original to the CCR store. With the flag off: rows still drop (compression still requested), no marker text, no store write, `ccr_hash` is `None`. - PyO3 bridge: new kwarg + getter on `PySmartCrusherConfig`. - Python shim: passes `enable_ccr_marker=self._ccr_config. inject_retrieval_marker` into the Rust config. Removes the audit's WARNING (the flag is now honored). - Parity comparator: tolerates the new field, defaults to `true` for fixtures that don't carry it. Gates ONLY the lossy row-drop sentinel path. Stage-3c.2 opaque-string CCR substitutions (`<<ccr:HASH,KIND,SIZE>>` from the document walker) still emit always — they have no Python equivalent, no production caller has asked for them to be suppressed, and gating them would silently disable compression of large strings (more confusing than no marker at all). RUST_DEV.md captures this. - 2 new Rust unit tests (`enable_ccr_marker_false_suppresses_marker_and_store`, `enable_ccr_marker_true_is_default_behavior`) lock the gate's semantics + the default. - The Python warning-asserter test from #300 is flipped to a behavior-asserter (`test_ccr_inject_marker_false_suppresses_markers_in_output`) + a positive counterpart that asserts markers DO appear when the flag is True. - `make ci-precheck` PASSED — 185 python tests + rust workspace + commitlint. Stacked on rust-stage-3d-audit-toin-ccr (#300) so the warning the audit added is removed in the same PR that makes it obsolete.

…N file The new SmartCrusher.apply() tests added in this PR feed the global TOIN learning store (default path `~/.headroom/toin.json`). On Python 3.11, CI runs the suite twice (regular + coverage); a pattern written in the first pass changes which rows the lossy sampler keeps in the second pass and breaks `tests/test_integrations/langchain/test_evals.py::TestRelevancePreservation::test_first_last_items_always_preserved`. Fix: an `isolated_toin` fixture that points `HEADROOM_TOIN_PATH` at a tempdir for the test and resets the singleton on enter and exit. Same isolation pattern PR #300 already uses for `test_smart_crusher_toin_attachment.py`.

chopratejas mentioned this pull request Apr 28, 2026

fix(smart_crusher): honor inject_retrieval_marker=False end-to-end (Rust gate) #301

Merged

4 tasks

chopratejas merged commit 33825c3 into main Apr 28, 2026
22 checks passed

chopratejas mentioned this pull request Apr 29, 2026

fix(smart_crusher): close audit watch list — CCR.enabled + custom scorer #305

Merged

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(smart_crusher): re-attach TOIN learning loop + audit known regressions#300

fix(smart_crusher): re-attach TOIN learning loop + audit known regressions#300
chopratejas merged 2 commits intomainfrom
rust-stage-3d-audit-toin-ccr

chopratejas commented Apr 28, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

chopratejas commented Apr 28, 2026

Summary

What was broken

Tests

Test plan

Why this lands separately (not stacked on the in-flight PRs)

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant