Skip to content

chore(rust): walker semantics in process_value — stringified-JSON + opaque strings#290

Merged
chopratejas merged 2 commits intomainfrom
rust-stage-3c-2-pr5-walker-integration
Apr 28, 2026
Merged

chore(rust): walker semantics in process_value — stringified-JSON + opaque strings#290
chopratejas merged 2 commits intomainfrom
rust-stage-3c-2-pr5-walker-integration

Conversation

@chopratejas
Copy link
Copy Markdown
Owner

Summary

Stage 3c.2 PR5. Closes the gap between the public `crush()` API and the standalone `DocumentCompactor` walker. Augments `SmartCrusher::process_value`'s String branch to mirror the walker's two String cases:

  1. Stringified-JSON containers — strings whose content parses to a JSON object/array → parse, recurse via `process_value`, re-emit.
  2. Opaque blobs — long base64 / HTML / long-text strings → substitute with `<ccr:HASH,KIND,SIZE>` markers (same format as walker.rs + PR4's lossy CCR-Dropped markers).

Type: `chore` so the package version doesn't bump.

Why this matters

After PR4, `SmartCrusher::new()` does lossless-first compaction on top-level arrays. But for tool outputs that wrap a JSON-encoded payload inside a string field (extremely common with MCP, function-call APIs, etc.) or contain opaque blobs (base64-encoded files, HTML chunks), the public API was a no-op. The walker handled these — but only via its own entry point.

PR5 brings walker semantics into the public path. The user's original vision now works for `crush()` callers without any extra setup:

JSON within JSON, opaque payloads, multi-layered

Implementation (~80 lines + tests)

  • `process_string` method on `SmartCrusher`: dispatches to JSON-recurse / CCR-substitute / passthrough.
  • `try_parse_json_container_str`: cheap parse-only-containers helper.
  • `ccr_marker_for_string` + helpers: marker formatting matching walker.rs byte-for-byte.
  • Special case: when recursion returns `Value::String` (PR4 lossless compaction substituted the array), use the string directly. Avoids double-JSON-encoding the rendered CSV+schema bytes.

Reuses every PR2 primitive — no new traits, no new IR, no new abstractions.

Tests (6 new)

Test What it verifies
`process_string_short_string_passthrough` No false positives on plain strings
`process_string_stringified_json_array_recurses` 50-item array inside a string field gets processed
`process_string_opaque_blob_becomes_ccr_marker` Base64 blob → `<ccr:HASH,base64,SIZE>` marker
`process_string_top_level_string_processed` `crush(plain_text)` still unchanged
`process_string_does_not_alter_short_quoted_strings` Short JSON-looking strings stay verbatim (no false-positive opaque)
`process_string_helper_parses_only_containers` Helper rejects bare JSON scalars

303/303 smart_crusher tests pass (was 297 → +6 PR5). 185/185 Python tests. `make ci-precheck` green.

What's next

  • PR6 / future — `CompactionPolicy` trait when an Enterprise customer asks for dynamic boundary decisions (sensitivity tags, model capability, flow position).
  • A/B eval harness — measure format-quality tradeoffs across CSV+schema / JSON / TOON on real fixtures.
  • MCP / proxy integration — once eval data lands, decide whether to opt those runtime paths into lossless too.

…paque strings

Stage 3c.2 PR5. Closes the gap between the public `crush()` API and
the standalone `DocumentCompactor` walker. Augments
`SmartCrusher::process_value`'s String branch to mirror the walker's
two String cases:

  1. Stringified-JSON containers: parse, recurse via process_value,
     re-emit. The wrapping field stays a string but its contents are
     processed end-to-end. Special-cases the lossless-compaction
     path (when recursion returns Value::String) to avoid double-
     JSON-encoding.

  2. Opaque blobs (long base64 / HTML / long-text strings):
     substitute with `<<ccr:HASH,KIND,SIZE>>` markers — same format
     as walker.rs and PR4's lossy CCR-Dropped markers, so downstream
     consumers can pattern-match regardless of which path emitted.

Type chore so the package version doesn't bump.

# Why this matters

After PR4, calling `SmartCrusher::new()` and then `crush(json_blob)`
gets lossless-first compaction on top-level arrays. But for tool
outputs that wrap a JSON-encoded payload INSIDE a string field, or
contain opaque blobs (base64-encoded files, HTML chunks), the public
API was a no-op. The walker handled these but only via its own
entry point; users calling `crush()` never hit it.

PR5 brings walker semantics into the public path. Same vision the
user described early in this stage:

> JSON within JSON, opaque payloads, multi-layered

now works for `crush()` callers without any extra setup.

# Implementation (~80 lines)

- `process_string` method on SmartCrusher: dispatches to JSON-recurse
  / CCR-substitute / passthrough.
- `try_parse_json_container_str`: cheap parse-only-containers helper.
- `ccr_marker_for_string` + `opaque_kind_label` + `humanize_bytes`:
  marker formatting matching walker.rs byte-for-byte.

Reuses every PR2 primitive — no new traits, no new IR, no new
abstractions.

# Tests (6 new)

- Short string passthrough (no false positives)
- Stringified-JSON array recurses (50 items inside a string field)
- Opaque base64 blob → CCR marker substitution
- Top-level plain text passthrough (crush(plain_text) unchanged)
- Short JSON-looking strings unchanged (no false-positive opaque)
- Helper parses only containers, not bare scalars

303/303 smart_crusher tests pass (was 297 → +6 PR5).
185/185 Python tests. make ci-precheck green.

Module: crates/headroom-core/src/transforms/smart_crusher/crusher.rs
….toml

The action was set to @stable, which installs whatever the latest
stable is (1.95.0 right now). Then maturin invokes cargo, which reads
rust-toolchain.toml and re-resolves to "1.95.0 + clippy + rustfmt".
rustup treats stable and 1.95.0 as distinct toolchain identities and
refuses the second install with:

  failed to install component 'clippy-preview-x86_64-unknown-linux-gnu',
  detected conflict: 'bin/cargo-clippy'

This was intermittent across the matrix (only test (3.10) tripped on
the most recent run; others got lucky on cache state). Pinning the
action ref to 1.95.0 makes both sides ask for the exact same toolchain
identity, so the second install is a no-op and the conflict can't fire.

Bump procedure stays the same: when rust-toolchain.toml's channel
changes, update these refs in lock-step.

Plugin manifests auto-bumped 0.11.0 -> 0.13.2 by sync-plugin-versions
hook (unrelated to the workflow fix).
@chopratejas chopratejas merged commit b376012 into main Apr 28, 2026
25 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant