Skip to content

feat(rust): SmartCrusher PR4 — lossless-first default + CCR-Dropped restoration#287

Merged
chopratejas merged 3 commits intomainfrom
rust-stage-3c-2-pr4-lossless-first-default
Apr 28, 2026
Merged

feat(rust): SmartCrusher PR4 — lossless-first default + CCR-Dropped restoration#287
chopratejas merged 3 commits intomainfrom
rust-stage-3c-2-pr4-lossless-first-default

Conversation

@chopratejas
Copy link
Copy Markdown
Owner

Summary

Two things ship here:

  1. Restores CCR-Dropped on the lossy path. Python's SmartCrusher always cached the full original payload via CCR when rows were dropped — the cornerstone reversibility guarantee. The Rust port had silently dropped that. PR4 brings it back: `ccr_hash` is populated whenever rows are dropped, and `dropped_summary` carries a `<<ccr:HASH N_rows_offloaded>>` marker so the LLM knows there's retrievable data.

  2. Flips the OSS default to lossless-first with a configurable threshold. `SmartCrusher::new()` now tries lossless compaction; if savings ≥ `lossless_min_savings_ratio` (default `0.30`), it ships the compacted form. Otherwise it falls through to the lossy path — which now properly emits CCR markers.

Lossy ≠ data loss

This is the key reframing. In headroom, "lossy" means "compressed view inline; full payload retrievable via CCR cache." Same semantics as Python's CCR-Dropped:

  • The runtime (PyO3 bridge / proxy server) caches the full original keyed by `ccr_hash`
  • The prompt sees the compressed inline output + a CCR marker
  • When the LLM needs more, it tool-calls the retrieval API
  • Nothing is ever actually lost

This PR makes that contract explicit again.

The dispatch logic

```rust
crush_array(items):
ccr_hash = sha256(items)[:12] // for lossy path

if compaction enabled:
rendered = compact(items)
if rendered.is_compacted:
savings = 1 - len(rendered) / len(input)
if savings >= threshold: // default 0.30
return lossless(rendered) // nothing dropped, no CCR

// fall through to lossy
kept = lossy_planner(items)
return lossy(kept, ccr_hash, dropped_marker) // CCR-Dropped restored
```

Configurable threshold (no hardcoding)

`SmartCrusherConfig.lossless_min_savings_ratio: f64` defaults to `0.30`. Override at construction time. Set to `0.0` to always prefer lossless when available; set to `1.0` to effectively disable the lossless path.

This is the OSS-grade expression of the boundary. Enterprise customers needing dynamic decisions (sensitivity tags, flow position, model capability) get a fuller `CompactionPolicy` trait when they ask — but for OSS, one knob is enough.

What changed

  • `SmartCrusherConfig`: new `lossless_min_savings_ratio` field (default `0.30`).
  • `SmartCrusher::new()`: includes the compaction stage by default.
  • `SmartCrusher::without_compaction()`: explicit opt-out for legacy parity fixtures and callers who don't want lossless attempts.
  • `crush_array`: rewritten with lossless-first dispatch + CCR-Dropped on lossy.
  • `process_value`: substitutes the compacted string into the JSON tree when lossless wins (so `crush()` output reflects the win at the public API layer).
  • PyO3 bridge: `SmartCrusher.without_compaction()` static method; `Python SmartCrusher` wrapper accepts `with_compaction=True` (default).
  • Parity harness: legacy fixtures use `without_compaction()` to preserve byte-equal coverage of the lossy path.

Tests

Suite Result
Rust `headroom-core` lib (smart_crusher) 281/281 (was 277 — +4 net new)
Python parity (legacy 17 fixtures) 21/21
Python lossless default smoke 3/3 (new)
Python retention 21/21 (updated to opt-in to lossy path)
`make ci-precheck` ✅ green

Six new Rust tests cover the dispatch:

  • `lossless_wins_when_savings_above_threshold`
  • `lossy_falls_through_when_savings_below_threshold`
  • `ccr_hash_is_deterministic`
  • `ccr_hash_changes_with_input`
  • `lossy_without_compaction_still_emits_ccr_hash`
  • `passthrough_paths_do_not_emit_ccr_hash`

What this PR is NOT

  • Not a budget enforcer. PR3b will add the document-level token budget + selective lossy escalation when even lossless+lossy can't fit. Out of scope here.
  • Not a `CompactionPolicy` trait. Enterprise dynamic decision (signals → strategy) lands when an Enterprise customer asks. Today's threshold knob is the OSS expression.
  • Not a re-introduction of PR3a's `DocumentCompactor` walker. That content was lost in the PR2 squash-merge and needs its own follow-up PR.

Next up

  • PR3a-redux: re-land the `DocumentCompactor` walker (was on the lost PR3a branch). Walks any JSON document and applies lossless compaction at every compactable spot.
  • PR3b: document-level token budget — escalate biggest blocks to lossy + CCR when total still exceeds budget.
  • PR5: A/B eval harness comparing format quality on real fixtures.

…estoration

Stage 3c.2 PR4. Restores Python's CCR-Dropped semantics on the lossy
path (the cornerstone reversibility guarantee that the port had
silently dropped) and flips the OSS default to lossless-first with a
configurable savings threshold.

# The user-visible behavior

Default `SmartCrusher::new()` now runs:

  1. Try lossless compaction.
  2. If savings >= `lossless_min_savings_ratio` (default 0.30), ship
     it — `compacted` populated, `ccr_hash = None`, nothing dropped.
  3. Otherwise fall through to the lossy path — drop rows AND
     populate `ccr_hash` so the runtime can cache the full original
     for tool-call retrieval.

**No data is ever lost.** "Lossy" means "compressed view inline; full
payload retrievable via CCR cache" — same semantics as Python's
SmartCrusher with CCR enabled. The runtime (PyO3 bridge / proxy
server) owns the cache; this crate computes the hash and emits a
marker so the prompt knows where to look.

# What changed

- `SmartCrusherConfig.lossless_min_savings_ratio: f64` (default 0.30).
  Single configurable knob — Enterprise overrides as needed. Below
  the threshold, lossless declines and lossy + CCR runs.

- `SmartCrusher::new(cfg)` flips to include the compaction stage by
  default. `SmartCrusher::without_compaction(cfg)` is the explicit
  opt-out for callers / fixtures that depend on pre-PR4 behavior.

- `crush_array` rewritten:
  - Lossless-first dispatch with savings-ratio gate
  - Lossy path now hashes the full original (12-char SHA-256 prefix)
    and emits a CCR-Dropped marker in `dropped_summary` whenever
    rows are dropped
  - `ccr_hash` field populated whenever rows were dropped
  - `process_value` substitutes the compacted string into the JSON
    tree when lossless wins, so `crush()` output reflects the win

- PyO3 bridge: `SmartCrusher.without_compaction()` static method;
  `SmartCrusherConfig` exposes the new `lossless_min_savings_ratio`
  field; Python `SmartCrusher` wrapper accepts `with_compaction=True`
  (default) and routes to the right Rust constructor.

- Parity harness: legacy 17 fixtures use `without_compaction()` so
  byte-equal coverage of the lossy path is preserved.

# Tests

- Rust: 281/281 smart_crusher unit tests pass (was 277). Six new
  tests cover: lossless wins above threshold, lossy falls through
  below threshold, CCR hash deterministic + input-dependent, lossy
  without compaction emits CCR, passthrough paths don't emit CCR,
  without_compaction yields no compacted field.
- Python parity: 21/21 (legacy fixtures via without_compaction).
- Python lossless default smoke: 3/3 new tests in
  test_smart_crusher_lossless_default.py.
- Python retention: 21/21 (updated to opt into the lossy path
  explicitly since their semantics target row-level retention).
- make ci-precheck green.

Modules:
  crates/headroom-core/src/transforms/smart_crusher/{config,crusher}.rs
  crates/headroom-parity/src/lib.rs
  crates/headroom-py/src/lib.rs
  headroom/transforms/smart_crusher.py
  tests/test_quality_retention.py
  tests/test_transforms/test_smart_crusher_{lossless_default,rust_parity}.py
PR4 flipped the OSS default to lossless-first. The MCP server and
LangChain eval tests assert wire-format and row-level retention
properties that belong to the lossy path; the lossless path
substitutes a CSV+schema STRING in place of arrays, which is great
for LLM prompts but wire-incompatible with consumers that iterate
the JSON.

Pin both call sites to the lossy + CCR-Dropped path via
`with_compaction=False`. Same retention semantics as Python's
pre-PR4 SmartCrusher behavior — full payload still cached via CCR
for tool retrieval; nothing is lost.

Modules:
- headroom/integrations/mcp/server.py — runtime MCP wrapper
- tests/test_integrations/langchain/test_evals.py — eval fixture

CI run that surfaced these: actions/runs/25025161868
Two more tests broke after PR4's lossless-first default flip — same
root cause as the langchain/MCP fixes (#287's first patch):

- tests/test_proxy_ccr.py — TestEndToEndTOINIntegration asserts
  CCR-cache state after compression. Lossless wins on the test
  fixture and skips CCR entirely (nothing dropped). Pin to lossy
  via with_compaction=False so the cache assertion holds.

- tests/test_text_compressors.py — TestSmartCrusherTextIntegration
  asserts JSON-array shape round-trip. Lossless substitutes a
  CSV+schema string. Pin to lossy + JSON shape via
  with_compaction=False. Lossless coverage exists separately in
  test_smart_crusher_lossless_default.py.

Same pattern, same fix. CI run that surfaced these:
actions/runs/25025876328
@chopratejas chopratejas merged commit 8e09878 into main Apr 28, 2026
27 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant