feat(rust): SmartCrusher PR4 — lossless-first default + CCR-Dropped restoration#287
Merged
chopratejas merged 3 commits intomainfrom Apr 28, 2026
Merged
Conversation
…estoration
Stage 3c.2 PR4. Restores Python's CCR-Dropped semantics on the lossy
path (the cornerstone reversibility guarantee that the port had
silently dropped) and flips the OSS default to lossless-first with a
configurable savings threshold.
# The user-visible behavior
Default `SmartCrusher::new()` now runs:
1. Try lossless compaction.
2. If savings >= `lossless_min_savings_ratio` (default 0.30), ship
it — `compacted` populated, `ccr_hash = None`, nothing dropped.
3. Otherwise fall through to the lossy path — drop rows AND
populate `ccr_hash` so the runtime can cache the full original
for tool-call retrieval.
**No data is ever lost.** "Lossy" means "compressed view inline; full
payload retrievable via CCR cache" — same semantics as Python's
SmartCrusher with CCR enabled. The runtime (PyO3 bridge / proxy
server) owns the cache; this crate computes the hash and emits a
marker so the prompt knows where to look.
# What changed
- `SmartCrusherConfig.lossless_min_savings_ratio: f64` (default 0.30).
Single configurable knob — Enterprise overrides as needed. Below
the threshold, lossless declines and lossy + CCR runs.
- `SmartCrusher::new(cfg)` flips to include the compaction stage by
default. `SmartCrusher::without_compaction(cfg)` is the explicit
opt-out for callers / fixtures that depend on pre-PR4 behavior.
- `crush_array` rewritten:
- Lossless-first dispatch with savings-ratio gate
- Lossy path now hashes the full original (12-char SHA-256 prefix)
and emits a CCR-Dropped marker in `dropped_summary` whenever
rows are dropped
- `ccr_hash` field populated whenever rows were dropped
- `process_value` substitutes the compacted string into the JSON
tree when lossless wins, so `crush()` output reflects the win
- PyO3 bridge: `SmartCrusher.without_compaction()` static method;
`SmartCrusherConfig` exposes the new `lossless_min_savings_ratio`
field; Python `SmartCrusher` wrapper accepts `with_compaction=True`
(default) and routes to the right Rust constructor.
- Parity harness: legacy 17 fixtures use `without_compaction()` so
byte-equal coverage of the lossy path is preserved.
# Tests
- Rust: 281/281 smart_crusher unit tests pass (was 277). Six new
tests cover: lossless wins above threshold, lossy falls through
below threshold, CCR hash deterministic + input-dependent, lossy
without compaction emits CCR, passthrough paths don't emit CCR,
without_compaction yields no compacted field.
- Python parity: 21/21 (legacy fixtures via without_compaction).
- Python lossless default smoke: 3/3 new tests in
test_smart_crusher_lossless_default.py.
- Python retention: 21/21 (updated to opt into the lossy path
explicitly since their semantics target row-level retention).
- make ci-precheck green.
Modules:
crates/headroom-core/src/transforms/smart_crusher/{config,crusher}.rs
crates/headroom-parity/src/lib.rs
crates/headroom-py/src/lib.rs
headroom/transforms/smart_crusher.py
tests/test_quality_retention.py
tests/test_transforms/test_smart_crusher_{lossless_default,rust_parity}.py
PR4 flipped the OSS default to lossless-first. The MCP server and LangChain eval tests assert wire-format and row-level retention properties that belong to the lossy path; the lossless path substitutes a CSV+schema STRING in place of arrays, which is great for LLM prompts but wire-incompatible with consumers that iterate the JSON. Pin both call sites to the lossy + CCR-Dropped path via `with_compaction=False`. Same retention semantics as Python's pre-PR4 SmartCrusher behavior — full payload still cached via CCR for tool retrieval; nothing is lost. Modules: - headroom/integrations/mcp/server.py — runtime MCP wrapper - tests/test_integrations/langchain/test_evals.py — eval fixture CI run that surfaced these: actions/runs/25025161868
Two more tests broke after PR4's lossless-first default flip — same root cause as the langchain/MCP fixes (#287's first patch): - tests/test_proxy_ccr.py — TestEndToEndTOINIntegration asserts CCR-cache state after compression. Lossless wins on the test fixture and skips CCR entirely (nothing dropped). Pin to lossy via with_compaction=False so the cache assertion holds. - tests/test_text_compressors.py — TestSmartCrusherTextIntegration asserts JSON-array shape round-trip. Lossless substitutes a CSV+schema string. Pin to lossy + JSON shape via with_compaction=False. Lossless coverage exists separately in test_smart_crusher_lossless_default.py. Same pattern, same fix. CI run that surfaced these: actions/runs/25025876328
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Two things ship here:
Restores CCR-Dropped on the lossy path. Python's SmartCrusher always cached the full original payload via CCR when rows were dropped — the cornerstone reversibility guarantee. The Rust port had silently dropped that. PR4 brings it back: `ccr_hash` is populated whenever rows are dropped, and `dropped_summary` carries a `<<ccr:HASH N_rows_offloaded>>` marker so the LLM knows there's retrievable data.
Flips the OSS default to lossless-first with a configurable threshold. `SmartCrusher::new()` now tries lossless compaction; if savings ≥ `lossless_min_savings_ratio` (default `0.30`), it ships the compacted form. Otherwise it falls through to the lossy path — which now properly emits CCR markers.
Lossy ≠ data loss
This is the key reframing. In headroom, "lossy" means "compressed view inline; full payload retrievable via CCR cache." Same semantics as Python's CCR-Dropped:
This PR makes that contract explicit again.
The dispatch logic
```rust
crush_array(items):
ccr_hash = sha256(items)[:12] // for lossy path
if compaction enabled:
rendered = compact(items)
if rendered.is_compacted:
savings = 1 - len(rendered) / len(input)
if savings >= threshold: // default 0.30
return lossless(rendered) // nothing dropped, no CCR
// fall through to lossy
kept = lossy_planner(items)
return lossy(kept, ccr_hash, dropped_marker) // CCR-Dropped restored
```
Configurable threshold (no hardcoding)
`SmartCrusherConfig.lossless_min_savings_ratio: f64` defaults to `0.30`. Override at construction time. Set to `0.0` to always prefer lossless when available; set to `1.0` to effectively disable the lossless path.
This is the OSS-grade expression of the boundary. Enterprise customers needing dynamic decisions (sensitivity tags, flow position, model capability) get a fuller `CompactionPolicy` trait when they ask — but for OSS, one knob is enough.
What changed
Tests
Six new Rust tests cover the dispatch:
What this PR is NOT
Next up