Skip to content

fix(smart_crusher): re-attach TOIN learning loop + audit known regressions#300

Merged
chopratejas merged 2 commits intomainfrom
rust-stage-3d-audit-toin-ccr
Apr 28, 2026
Merged

fix(smart_crusher): re-attach TOIN learning loop + audit known regressions#300
chopratejas merged 2 commits intomainfrom
rust-stage-3d-audit-toin-ccr

Conversation

@chopratejas
Copy link
Copy Markdown
Owner

Summary

Audit fix for the silent regressions introduced when SmartCrusher's Python implementation was retired in Stage 3c.1b. Three subsystems shipped disconnected; this PR fixes one outright, labels the other two visibly, and adds a RUST_DEV.md section so future audits don't have to rediscover them.

This is not a Rust change — pure Python shim work, lands directly on main.

What was broken

Subsystem Before After
TOIN learning for SmartCrusher Silently disconnected. ContentRouter skipped TOIN recording for SmartCrusher on the assumption SmartCrusher recorded its own (the retired Python class did). The Rust port doesn't know about TOIN. Net: the highest-traffic compression strategy stopped fueling the learning loop. Re-attached. Shim's `crush()` and `_smart_crush_content()` call `toin.record_compression()` after real compressions (filter on `strategy != "passthrough"` to ignore JSON re-canonicalization).
`ccr_config.inject_retrieval_marker=False` Silently ignored. Rust emits `<<ccr:HASH ...>>` markers in `dropped_summary` unconditionally. Same as before, but now logs a WARNING when callers pass `False` so the mismatch is visible. Fix needs a Rust-side gate; tracked in RUST_DEV.md.
Custom `relevance_config` / `scorer` overrides Logged at debug level then dropped. Bumped to WARNING. Same root cause as above; needs Rust-side support.

Tests

7 regression tests in `tests/test_smart_crusher_toin_attachment.py`:

  • `crush()` records on real compression, skips passthrough
  • structurally-similar inputs land on the same pattern (TOIN aggregation works)
  • `_smart_crush_content()` records (legacy `apply()` path)
  • TOIN errors are non-fatal (compression still completes)
  • non-JSON input doesn't record
  • `inject_retrieval_marker=False` emits the WARNING

Test plan

  • 7 new tests pass
  • No regressions in `test_smart_crusher_*` (41 tests across SmartCrusher / parity / CCR / TOIN integration)
  • `make ci-precheck` PASSED — 185 python tests + rust workspace + commitlint

Why this lands separately (not stacked on the in-flight PRs)

This is a bug fix for already-shipped retirements (`Stage 3c.1b`, merged April 27). The fix is independent of #295 (ContentDetector) and #299 (ContentRouter types) which are forward-looking ports. Splitting keeps the audit visible and reviewable on its own.

…sions

The Stage 3c.1b retirement of the Python SmartCrusher silently disconnected
three subsystems. The audit on 2026-04-28 caught them; this commit fixes
what's fixable today and labels the rest visibly.

## TOIN learning loop — fixed

Before: `ContentRouter._record_to_toin` skipped SmartCrusher on the
assumption SmartCrusher recorded its own TOIN events. The retired Python
class did. The Rust port doesn't know about TOIN. Net result: the
highest-traffic compression strategy stopped fueling the learning loop,
silently.

Fix: shim's `crush()` and `_smart_crush_content()` now call
`toin.record_compression()` after a real compression. Filtered on
`strategy != "passthrough"` because the Rust port flips
`was_modified=True` from JSON whitespace re-canonicalization. Best
effort: TOIN failures are logged at debug level and never break
compression.

Token estimates use `len(json) // 4` (the rule the retired Python used)
because the router doesn't pass a tokenizer down to this layer and
re-tokenizing here would dominate the recording cost.

7 tests in `tests/test_smart_crusher_toin_attachment.py`:
- crush() records on real compression, doesn't on passthrough
- structurally-similar inputs land on the same pattern
- _smart_crush_content() records (legacy apply() path)
- TOIN errors don't break compression
- non-JSON input doesn't record
- inject_retrieval_marker=False emits a WARNING

## CCR marker emission knob — labelled, not yet fixed

`ccr_config.inject_retrieval_marker=False` is not honored — the Rust
port emits `<<ccr:HASH N_rows_offloaded>>` markers in `dropped_summary`
unconditionally. Today the production default has the flag True so no
one is hitting the gap, but the silent-disconnect was a real
regression. Shim now logs a WARNING when callers pass `False` so the
mismatch is visible. Fix needs a Rust-side gate; tracked in
`RUST_DEV.md`.

## Custom relevance scorer — labelled, not yet fixed

`relevance_config` and `scorer` constructor args are accepted for
source compatibility but the Rust default `HybridScorer` always
runs. Shim was logging this at debug; bumped to WARNING. Tracked.

## RUST_DEV.md

New "Known regressions in retired-Python components" section with a
table per retired component. The intent is that this section gets
updated whenever a regression closes (or a new one is found), so the
duplicate-codebase tax stays visible instead of decaying into folklore.
…tion

`caplog` flakes under the full suite — earlier tests can leave
`logging.disable()` or root-handler state that suppresses records on
the named logger before pytest's caplog filter sees them. The test
passed in isolation but failed deterministically once the larger
suite preceded it (CI run 25070639932 across all four Python
versions, ~4500 tests in).

Fix: monkeypatch `headroom.transforms.smart_crusher.logger.warning`
directly. Bypasses every level/disable/handler concern; we only
assert the constructor *called* `logger.warning` with the expected
flag name in the message — which is exactly the contract this test
guards.
chopratejas added a commit that referenced this pull request Apr 28, 2026
Closes the second-of-three gaps the audit (#300) labelled. The Rust
port now has a real on/off switch for CCR row-drop marker emission;
the Python shim flips it from `ccr_config.inject_retrieval_marker`.

- New `enable_ccr_marker: bool` (default `true`) on
  `SmartCrusherConfig`. Default preserves existing behavior — every
  recorded parity fixture still matches byte-for-byte because the gate
  was added at the lossy path's marker emission site, not the
  decision tree.
- `crusher.rs::crush_array` checks the flag before emitting the
  `<<ccr:HASH N_rows_offloaded>>` marker AND before writing the
  original to the CCR store. With the flag off: rows still drop
  (compression still requested), no marker text, no store write,
  `ccr_hash` is `None`.
- PyO3 bridge: new kwarg + getter on `PySmartCrusherConfig`.
- Python shim: passes `enable_ccr_marker=self._ccr_config.
  inject_retrieval_marker` into the Rust config. Removes the audit's
  WARNING (the flag is now honored).
- Parity comparator: tolerates the new field, defaults to `true` for
  fixtures that don't carry it.

Gates ONLY the lossy row-drop sentinel path. Stage-3c.2 opaque-string
CCR substitutions (`<<ccr:HASH,KIND,SIZE>>` from the document walker)
still emit always — they have no Python equivalent, no production
caller has asked for them to be suppressed, and gating them would
silently disable compression of large strings (more confusing than
no marker at all). RUST_DEV.md captures this.

- 2 new Rust unit tests (`enable_ccr_marker_false_suppresses_marker_and_store`,
  `enable_ccr_marker_true_is_default_behavior`) lock the gate's
  semantics + the default.
- The Python warning-asserter test from #300 is flipped to a
  behavior-asserter (`test_ccr_inject_marker_false_suppresses_markers_in_output`)
  + a positive counterpart that asserts markers DO appear when the
  flag is True.
- `make ci-precheck` PASSED — 185 python tests + rust workspace + commitlint.

Stacked on rust-stage-3d-audit-toin-ccr (#300) so the warning the
audit added is removed in the same PR that makes it obsolete.
@chopratejas chopratejas merged commit 33825c3 into main Apr 28, 2026
22 checks passed
chopratejas added a commit that referenced this pull request Apr 28, 2026
…N file

The new SmartCrusher.apply() tests added in this PR feed the global
TOIN learning store (default path `~/.headroom/toin.json`). On Python
3.11, CI runs the suite twice (regular + coverage); a pattern written
in the first pass changes which rows the lossy sampler keeps in the
second pass and breaks
`tests/test_integrations/langchain/test_evals.py::TestRelevancePreservation::test_first_last_items_always_preserved`.

Fix: an `isolated_toin` fixture that points `HEADROOM_TOIN_PATH` at a
tempdir for the test and resets the singleton on enter and exit. Same
isolation pattern PR #300 already uses for `test_smart_crusher_toin_attachment.py`.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant