feat(rust): SmartCrusher PR2 — lossless-first tabular compaction#285
Merged
chopratejas merged 2 commits intomainfrom Apr 27, 2026
Merged
feat(rust): SmartCrusher PR2 — lossless-first tabular compaction#285chopratejas merged 2 commits intomainfrom
chopratejas merged 2 commits intomainfrom
Conversation
…ilder Stage 3c.2 PR1 — the public extension surface that lets Enterprise crates plug richer components into SmartCrusher without forking. Three traits, one builder, behavior-equivalent on every parity fixture. The three traits: - Scorer (re-exported from `crate::relevance::RelevanceScorer`). Already a trait; OSS HybridScorer (BM25 + fastembed). Enterprise point: per-tenant Loop-trained scorer. - Constraint (new in `traits.rs`). `must_keep(items, item_strings) -> Vec<usize>` — indices the allocator must keep regardless of saliency. OSS defaults: `KeepErrorsConstraint`, `KeepStructuralOutliersConstraint` — thin wrappers around the existing `detect_error_items_for_preservation` and `detect_structural_outliers` functions. Enterprise point: BusinessRuleConstraint, RegulatoryConstraint::HIPAA, and so on. - Observer (new in `traits.rs`). `on_event(&CrushEvent)` fires once per top-level `crush()` call with strategy + sizes + elapsed_ns. OSS default: TracingObserver — writes to the `tracing` crate at debug, zero-cost when filtered out. Enterprise point: AuditObserver, MetricsObserver, LoopTrainingObserver. The builder (`builder.rs`): `SmartCrusherBuilder::new(config)` starts EMPTY (no scorer, no constraints, no observers — explicit composition; "no silent fallbacks" applied to the API surface). Methods stack: with_scorer, add_constraint, add_default_oss_constraints (appends KeepErrors + KeepStructuralOutliers), add_observer, with_default_oss_setup (HybridScorer + default constraints + TracingObserver in one call). `SmartCrusher::new(config)` is preserved as the OSS default factory (equivalent to `SmartCrusher::builder(config).with_default_oss_setup .build()`). Every existing caller (proxy, content_router, integrations, evals) continues to work unchanged. Internal refactor: `SmartCrusherPlanner` now holds `&[Box<dyn Constraint>]` and iterates the configured constraints via a new `apply_constraints(items, item_strings, keep)` method. Replaces four hardcoded `detect_structural_outliers` + `detect_error_items_for_preservation` call sites in the four plan methods. With the OSS default constraint stack the must-keep set is byte-identical to pre-PR1 — verified by all 17 parity fixtures. `SmartCrusher` gained two fields: `constraints: Vec<Box<dyn Constraint>>` and `observers: Vec<Box<dyn Observer>>`. New `from_parts` constructor (#[doc(hidden)]) is the builder's exit point. What did NOT change in this PR: - The internal planning algorithm (lossless tabular, saliency scoring, structured markers — those are PR 2/3/4). - The string/number/object/mixed-array crusher paths in `crushers.rs` and the `prioritize_indices` helper in `orchestration.rs` — they still call the detection functions directly. Path B from the design doc: dict-array path is the primary value plugin point; lifting the leaf compressors can come later if customers ask. Tests: 15 new tests across `traits.rs`, `constraints.rs`, `observer.rs`, `builder.rs`. Coverage: each constraint trait method called and pinned (errors flagged, structural outliers detected, item_strings cache parity, empty-array safety); builder empty-build path, default-OSS-stack append, add_constraint order preservation, with_default_oss_setup yields expected counts, observer fires end-to-end on a real crush; TracingObserver name stable, on_event doesn't panic. Verification: - cargo test --workspace: 403 passed (was 388, +15 new), 0 failed. - parity: 17/17 byte-equal for smart_crusher. - make ci-precheck: green. Stage 3c.2 PR sequence: - PR 1 (this commit): three traits + builder. - PR 2 (next): improvement A — TabularCompactor. - PR 3: improvement B — saliency scoring + structured allocator. - PR 4: improvement C — structured marker formatter. - PR 5: ENT-A — `headroom-enterprise` scaffold.
Stage 3c.2 PR2. Adds an opt-in compaction stage that runs BEFORE the
existing lossy pipeline. When configured, it tries to losslessly
re-shape arrays of objects into a recursive Compaction IR and renders
that to bytes via a pluggable Formatter trait. When not configured
(default OSS), behavior is byte-equal with the pre-PR2 path — all 17
SmartCrusher parity fixtures stay green.
# What lands
- Recursive Compaction IR (`compaction/ir.rs`): Table / Buckets /
OpaqueRef / Untouched. CellValue can hold a nested Compaction so
multi-level cases (stringified-JSON inside cells, heterogeneous
arrays bucketed by discriminator, opaque blobs CCR-substituted)
share one tree shape.
- Cell classifier (`compaction/classifier.rs`): per-cell decision —
Scalar / JsonObject / JsonArray / StringifiedJson(parsed) /
Opaque(kind). Conservative: in doubt, return Scalar.
- TabularCompactor (`compaction/compactor.rs`): array → IR. Handles
uniform-nested flattening into dotted columns ("meta.region",
"meta.tier"), stringified-JSON parsing + recursion, opaque-blob
CCR-substitution (12-char SHA-256 prefix), and heterogeneous
bucketing by discriminator. Falls through to a sparse Table when
no clean discriminator exists, so we always do better than the
lossy path for object arrays.
- Formatter trait (`compaction/formatter.rs`) + two impls:
- JsonFormatter: structured JSON for debugging / programmatic use.
- CsvSchemaFormatter: [N]{col:type,col:type} declaration + CSV
rows. Steals TOON's row-count-and-shape declaration without
adopting TOON's bespoke escaping. CSV is the format LLMs are
strongest at — every model has seen millions of examples in
training. >30% smaller than raw JSON serialization on tabular
fixtures.
- Wiring (`crusher.rs`, `builder.rs`): SmartCrusher gains an optional
compaction stage. Builder methods with_compaction(stage) and
with_default_compaction() opt in. CrushArrayResult gets two new
fields (compacted, compaction_kind) populated only when the stage
runs. strategy_info becomes compaction kind when compaction won.
# Why this design
- Three-trait extension surface preserved. PR1 added Constraint /
Observer / Scorer; PR2 adds Formatter as the fourth pluggable
seam. Enterprise plug-ins land cleanly without forking core.
- Empty default builder rule held. SmartCrusherBuilder::new() still
produces a no-compaction crusher. with_default_compaction() is
the explicit OSS preset. No silent fallbacks.
- Recursive IR was the unlock. A flat table-of-scalars IR would have
collapsed the moment a cell held nested JSON. Making
CellValue::Nested hold another Compaction made stringified-JSON
parsing + heterogeneous bucketing + opaque substitution all share
one renderer pass.
- CCR substitution for opaque cells. Strings classified as
base64/HTML/long-opaque become structured markers keyed by 12-char
SHA-256 prefix. The full bytes round-trip via the CCR store (PyO3
bridge owns actual storage; this PR emits the marker and computes
the hash).
# Tests
- 60 new unit tests across IR / classifier / compactor / formatter /
wiring (448 total in headroom-core, was 388).
- 17/17 SmartCrusher parity fixtures byte-equal — default-config
path completely unchanged.
- 21/21 Python parity tests pass via PyO3 bridge.
- make ci-precheck green: ruff, mypy, cargo fmt/clippy/test
(1.95.0), commitlint.
# Deferred to follow-up PRs
- ToonFormatter (small; ship after eval harness compares formats)
- Diff/code detection in cells → routes to DiffCompressor /
CodeCompressor (coupled to ContentRouter Phase 4)
- Budget-aware row dropping (Constraint-respecting) when rendered
size exceeds budget
- Format A/B eval harness
- ContentRouter unification (Phase 4)
Modules: crates/headroom-core/src/transforms/smart_crusher/compaction/*, builder.rs, crusher.rs, mod.rs
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Stage 3c.2 PR2. Adds an opt-in lossless-first compaction stage to
SmartCrusher. When configured, it tries to losslessly re-shape arrays of objects into a recursive IR and renders that to bytes via a pluggableFormattertrait. When NOT configured (default OSS), behavior is byte-equal with the pre-PR2 path — all 17 SmartCrusher parity fixtures stay green.This is the second of five PRs in Stage 3c.2. Builds on #284 (extension surface — Constraint, Observer, Scorer).
What lands
Five new modules under
crates/headroom-core/src/transforms/smart_crusher/compaction/:ir.rs— RecursiveCompactiontree. `Table` / `Buckets` / `OpaqueRef` / `Untouched`. `CellValue::Nested` can hold another `Compaction` so multi-level cases share one tree shape.`classifier.rs` — Per-cell decision: `Scalar` / `JsonObject` / `JsonArray` / `StringifiedJson(parsed)` / `Opaque(kind)`. Conservative: in doubt, return `Scalar`.
`compactor.rs` — Array → IR. Uniform-nested flattening into dotted columns (`meta.region`, `meta.tier`), stringified-JSON parsing + recursion, opaque-blob CCR-substitution (12-char SHA-256 prefix), heterogeneous bucketing by discriminator, sparse-table fallthrough.
`formatter.rs` — `Formatter` trait + two impls:
`mod.rs` — Re-exports + `CompactionStage` (composed config + formatter pair).
Wiring in `crusher.rs` + `builder.rs`:
Why this design
Tests
Deferred to follow-up PRs