Skip to content

feat(rust): SmartCrusher PR2 — lossless-first tabular compaction#285

Merged
chopratejas merged 2 commits intomainfrom
rust-stage-3c-2-pr2-tabular-compactor
Apr 27, 2026
Merged

feat(rust): SmartCrusher PR2 — lossless-first tabular compaction#285
chopratejas merged 2 commits intomainfrom
rust-stage-3c-2-pr2-tabular-compactor

Conversation

@chopratejas
Copy link
Copy Markdown
Owner

Summary

Stage 3c.2 PR2. Adds an opt-in lossless-first compaction stage to SmartCrusher. When configured, it tries to losslessly re-shape arrays of objects into a recursive IR and renders that to bytes via a pluggable Formatter trait. When NOT configured (default OSS), behavior is byte-equal with the pre-PR2 path — all 17 SmartCrusher parity fixtures stay green.

This is the second of five PRs in Stage 3c.2. Builds on #284 (extension surface — Constraint, Observer, Scorer).

What lands

Five new modules under crates/headroom-core/src/transforms/smart_crusher/compaction/:

  • ir.rs — Recursive Compaction tree. `Table` / `Buckets` / `OpaqueRef` / `Untouched`. `CellValue::Nested` can hold another `Compaction` so multi-level cases share one tree shape.

  • `classifier.rs` — Per-cell decision: `Scalar` / `JsonObject` / `JsonArray` / `StringifiedJson(parsed)` / `Opaque(kind)`. Conservative: in doubt, return `Scalar`.

  • `compactor.rs` — Array → IR. Uniform-nested flattening into dotted columns (`meta.region`, `meta.tier`), stringified-JSON parsing + recursion, opaque-blob CCR-substitution (12-char SHA-256 prefix), heterogeneous bucketing by discriminator, sparse-table fallthrough.

  • `formatter.rs` — `Formatter` trait + two impls:

    • `JsonFormatter`: structured JSON for debug / programmatic use
    • `CsvSchemaFormatter`: `[N]{col:type,col:type}` declaration + CSV rows. Steals TOON's row-count-and-shape idea without adopting TOON's bespoke escaping. >30% smaller than raw JSON on tabular fixtures.
  • `mod.rs` — Re-exports + `CompactionStage` (composed config + formatter pair).

Wiring in `crusher.rs` + `builder.rs`:

  • `SmartCrusher` gains `compaction: Option`
  • Builder: `with_compaction(stage)` and `with_default_compaction()`
  • `CrushArrayResult` gains two new fields (`compacted`, `compaction_kind`) populated only when the stage runs

Why this design

  • Three-trait extension surface preserved. PR1 added Constraint / Observer / Scorer; PR2 adds Formatter as the fourth pluggable seam. Enterprise plug-ins land cleanly.
  • Empty default builder rule held. `SmartCrusherBuilder::new()` still produces a no-compaction crusher. `with_default_compaction()` is the explicit OSS preset. No silent fallbacks.
  • Recursive IR was the unlock. A flat table-of-scalars IR would have collapsed the moment a cell held nested JSON. `CellValue::Nested` makes stringified-JSON parsing, heterogeneous bucketing, and opaque substitution share one renderer pass.
  • CSV+schema, not TOON. Discussed extensively before implementing — TOON saves more tokens but pays a comprehension reliability tax (LLMs are weakest at parsing the format least represented in their training data). CSV with explicit schema gets ~95% of TOON's win and stays in territory LLMs know cold. `ToonFormatter` is a 50-line follow-up once the eval harness lands.

Tests

  • 60 new unit tests across IR / classifier / compactor / formatter / wiring
  • 448/448 `headroom-core` lib tests pass (was 388 — +60 new)
  • 17/17 SmartCrusher parity fixtures byte-equal — default-config path unchanged
  • 21/21 Python parity tests pass via PyO3 bridge
  • `make ci-precheck` green: ruff, mypy, cargo fmt/clippy/test (1.95.0), commitlint
  • CI green on PR (now includes ci(docker): fix Argument list too long when signing bake outputs #283 docker hotfix)

Deferred to follow-up PRs

  • `ToonFormatter` (ships after eval harness)
  • Diff/code routing in cells → `DiffCompressor` / `CodeCompressor` (coupled to ContentRouter Phase 4)
  • Budget-aware row dropping (Constraint-respecting) when rendered size exceeds budget
  • Format A/B eval harness with model-quality scoring
  • ContentRouter unification (Phase 4)

…ilder

Stage 3c.2 PR1 — the public extension surface that lets Enterprise
crates plug richer components into SmartCrusher without forking. Three
traits, one builder, behavior-equivalent on every parity fixture.

The three traits:

- Scorer (re-exported from `crate::relevance::RelevanceScorer`).
  Already a trait; OSS HybridScorer (BM25 + fastembed). Enterprise
  point: per-tenant Loop-trained scorer.

- Constraint (new in `traits.rs`). `must_keep(items, item_strings)
  -> Vec<usize>` — indices the allocator must keep regardless of
  saliency. OSS defaults: `KeepErrorsConstraint`,
  `KeepStructuralOutliersConstraint` — thin wrappers around the
  existing `detect_error_items_for_preservation` and
  `detect_structural_outliers` functions. Enterprise point:
  BusinessRuleConstraint, RegulatoryConstraint::HIPAA, and so on.

- Observer (new in `traits.rs`). `on_event(&CrushEvent)` fires once
  per top-level `crush()` call with strategy + sizes + elapsed_ns.
  OSS default: TracingObserver — writes to the `tracing` crate at
  debug, zero-cost when filtered out. Enterprise point:
  AuditObserver, MetricsObserver, LoopTrainingObserver.

The builder (`builder.rs`):

`SmartCrusherBuilder::new(config)` starts EMPTY (no scorer, no
constraints, no observers — explicit composition; "no silent
fallbacks" applied to the API surface). Methods stack:
with_scorer, add_constraint, add_default_oss_constraints (appends
KeepErrors + KeepStructuralOutliers), add_observer,
with_default_oss_setup (HybridScorer + default constraints +
TracingObserver in one call).

`SmartCrusher::new(config)` is preserved as the OSS default factory
(equivalent to `SmartCrusher::builder(config).with_default_oss_setup
.build()`). Every existing caller (proxy, content_router,
integrations, evals) continues to work unchanged.

Internal refactor:

`SmartCrusherPlanner` now holds `&[Box<dyn Constraint>]` and
iterates the configured constraints via a new
`apply_constraints(items, item_strings, keep)` method. Replaces four
hardcoded `detect_structural_outliers` +
`detect_error_items_for_preservation` call sites in the four plan
methods. With the OSS default constraint stack the must-keep set is
byte-identical to pre-PR1 — verified by all 17 parity fixtures.

`SmartCrusher` gained two fields: `constraints: Vec<Box<dyn
Constraint>>` and `observers: Vec<Box<dyn Observer>>`. New
`from_parts` constructor (#[doc(hidden)]) is the builder's exit
point.

What did NOT change in this PR:

- The internal planning algorithm (lossless tabular, saliency
  scoring, structured markers — those are PR 2/3/4).
- The string/number/object/mixed-array crusher paths in
  `crushers.rs` and the `prioritize_indices` helper in
  `orchestration.rs` — they still call the detection functions
  directly. Path B from the design doc: dict-array path is the
  primary value plugin point; lifting the leaf compressors can come
  later if customers ask.

Tests:

15 new tests across `traits.rs`, `constraints.rs`, `observer.rs`,
`builder.rs`. Coverage: each constraint trait method called and
pinned (errors flagged, structural outliers detected, item_strings
cache parity, empty-array safety); builder empty-build path,
default-OSS-stack append, add_constraint order preservation,
with_default_oss_setup yields expected counts, observer fires
end-to-end on a real crush; TracingObserver name stable, on_event
doesn't panic.

Verification:
- cargo test --workspace: 403 passed (was 388, +15 new), 0 failed.
- parity: 17/17 byte-equal for smart_crusher.
- make ci-precheck: green.

Stage 3c.2 PR sequence:
- PR 1 (this commit): three traits + builder.
- PR 2 (next): improvement A — TabularCompactor.
- PR 3: improvement B — saliency scoring + structured allocator.
- PR 4: improvement C — structured marker formatter.
- PR 5: ENT-A — `headroom-enterprise` scaffold.
Stage 3c.2 PR2. Adds an opt-in compaction stage that runs BEFORE the
existing lossy pipeline. When configured, it tries to losslessly
re-shape arrays of objects into a recursive Compaction IR and renders
that to bytes via a pluggable Formatter trait. When not configured
(default OSS), behavior is byte-equal with the pre-PR2 path — all 17
SmartCrusher parity fixtures stay green.

# What lands

- Recursive Compaction IR (`compaction/ir.rs`): Table / Buckets /
  OpaqueRef / Untouched. CellValue can hold a nested Compaction so
  multi-level cases (stringified-JSON inside cells, heterogeneous
  arrays bucketed by discriminator, opaque blobs CCR-substituted)
  share one tree shape.

- Cell classifier (`compaction/classifier.rs`): per-cell decision —
  Scalar / JsonObject / JsonArray / StringifiedJson(parsed) /
  Opaque(kind). Conservative: in doubt, return Scalar.

- TabularCompactor (`compaction/compactor.rs`): array → IR. Handles
  uniform-nested flattening into dotted columns ("meta.region",
  "meta.tier"), stringified-JSON parsing + recursion, opaque-blob
  CCR-substitution (12-char SHA-256 prefix), and heterogeneous
  bucketing by discriminator. Falls through to a sparse Table when
  no clean discriminator exists, so we always do better than the
  lossy path for object arrays.

- Formatter trait (`compaction/formatter.rs`) + two impls:
  - JsonFormatter: structured JSON for debugging / programmatic use.
  - CsvSchemaFormatter: [N]{col:type,col:type} declaration + CSV
    rows. Steals TOON's row-count-and-shape declaration without
    adopting TOON's bespoke escaping. CSV is the format LLMs are
    strongest at — every model has seen millions of examples in
    training. >30% smaller than raw JSON serialization on tabular
    fixtures.

- Wiring (`crusher.rs`, `builder.rs`): SmartCrusher gains an optional
  compaction stage. Builder methods with_compaction(stage) and
  with_default_compaction() opt in. CrushArrayResult gets two new
  fields (compacted, compaction_kind) populated only when the stage
  runs. strategy_info becomes compaction kind when compaction won.

# Why this design

- Three-trait extension surface preserved. PR1 added Constraint /
  Observer / Scorer; PR2 adds Formatter as the fourth pluggable
  seam. Enterprise plug-ins land cleanly without forking core.

- Empty default builder rule held. SmartCrusherBuilder::new() still
  produces a no-compaction crusher. with_default_compaction() is
  the explicit OSS preset. No silent fallbacks.

- Recursive IR was the unlock. A flat table-of-scalars IR would have
  collapsed the moment a cell held nested JSON. Making
  CellValue::Nested hold another Compaction made stringified-JSON
  parsing + heterogeneous bucketing + opaque substitution all share
  one renderer pass.

- CCR substitution for opaque cells. Strings classified as
  base64/HTML/long-opaque become structured markers keyed by 12-char
  SHA-256 prefix. The full bytes round-trip via the CCR store (PyO3
  bridge owns actual storage; this PR emits the marker and computes
  the hash).

# Tests

- 60 new unit tests across IR / classifier / compactor / formatter /
  wiring (448 total in headroom-core, was 388).
- 17/17 SmartCrusher parity fixtures byte-equal — default-config
  path completely unchanged.
- 21/21 Python parity tests pass via PyO3 bridge.
- make ci-precheck green: ruff, mypy, cargo fmt/clippy/test
  (1.95.0), commitlint.

# Deferred to follow-up PRs

- ToonFormatter (small; ship after eval harness compares formats)
- Diff/code detection in cells → routes to DiffCompressor /
  CodeCompressor (coupled to ContentRouter Phase 4)
- Budget-aware row dropping (Constraint-respecting) when rendered
  size exceeds budget
- Format A/B eval harness
- ContentRouter unification (Phase 4)

Modules: crates/headroom-core/src/transforms/smart_crusher/compaction/*, builder.rs, crusher.rs, mod.rs
@chopratejas chopratejas merged commit 00e25d8 into main Apr 27, 2026
25 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant