Skip to content

chore(rust): port ContentDetector to Rust + parity + PyO3 bridge#295

Merged
chopratejas merged 1 commit intomainfrom
rust-stage-3d-pr1-content-detector
Apr 28, 2026
Merged

chore(rust): port ContentDetector to Rust + parity + PyO3 bridge#295
chopratejas merged 1 commit intomainfrom
rust-stage-3d-pr1-content-detector

Conversation

@chopratejas
Copy link
Copy Markdown
Owner

Summary

PR1 of the ContentRouter migration — ports headroom/transforms/content_detector.py to headroom-core with byte-equal parity, adds the parity comparator, and exposes detect_content_type + is_json_array_of_dicts to Python via PyO3 so PR2 (ContentRouter scaffold) can dispatch into Rust in-process.

  • Faithful regex port (no ML — Magika lives one level up in ContentRouter, by design). Same dispatch order, confidence formulas, and line-count caps as Python.
  • 21 recorded fixtures cover every dispatch branch (JSON arrays, diff, HTML, search, build/log, six languages, plain text/empty/whitespace fallbacks). Parity: 21/21 matched.
  • Two parity bugs found and fixed during port:
    1. TypeScript second pattern was unanchored in Rust (regex::is_match is not start-anchored, but Python's pattern.match(line) is). Anchored explicitly so is_match matches Python's narrow semantics.
    2. max() tie-break in code detection must follow Python dict insertion order, not registration order. Rust's Iterator::max_by returns last on ties; switched to score tracking with find(score == max) for first-on-tie behavior.
  • Universal f64 fix in the parity harnessserde_json (without arbitrary_precision) parses fixture floats lossily but emits f64s at full precision via the json! macro. Round-trip the actual through to_string+from_str so both sides go through the same parser. Future comparators with f64 outputs benefit automatically.

PyO3 bridge:

  • headroom._core.detect_content_type(content) -> DetectionResult
  • headroom._core.is_json_array_of_dicts(content) -> bool
  • DetectionResult exposes .content_type (string tag), .confidence (f64), .metadata (dict). GIL released during scan.

Test plan

  • 21 unit tests in content_detector.rs (all branches + edge cases)
  • Parity: cargo run -p headroom-parity --bin parity-run -- run --only content_detector → 21/21 matched
  • PyO3 bridge smoke-tested (json_array, source_code, plain_text via Python)
  • make ci-precheck PASSED (cargo fmt + clippy + workspace tests + 185 Python tests + commitlint)

Up next (per the 4-PR plan)

  • PR2: ContentRouter scaffold + non-Magika dispatch paths (uses this bridge)
  • PR3: Magika via the Rust magika crate
  • PR4: Compressor dispatch wiring

Faithful port of `headroom/transforms/content_detector.py` into
`headroom-core`. Same regex patterns, dispatch order, confidence
formulas, and line-count caps; lockstep with Python via 21 recorded
parity fixtures (every dispatch branch exercised).

- crates/headroom-core/src/transforms/content_detector.rs: regex-only
  detector (no ML); ContentType/DetectionResult mirror Python's enum +
  dataclass surface; metadata uses serde_json::Map for clean PyO3
  bridging.
- Tie-break in code detection: track scores in first-match insertion
  order (matches Python dict iteration semantics on `max()` ties).
- TypeScript second pattern is start-anchored — Python's
  `pattern.match(line)` is start-anchored, but the regex crate's
  `is_match` is unanchored, so the literal `:` prefix is required for
  parity.
- crates/headroom-parity: ContentDetectorComparator + universal f64
  normalization in `compare_fixture` (serde_json's lossy parse vs
  full-precision serialize creates a 1-ULP asymmetry that broke
  comparison; round-trip the actual through to_string/from_str).
- crates/headroom-py: detect_content_type/is_json_array_of_dicts and
  PyDetectionResult exposed via PyO3 with GIL released during scan.
- tests/parity/recorder.py: new `_wrap_function` for free-function
  recording; content_detector hook + 21 varied inputs covering JSON
  arrays, diffs, HTML, search, build/log, six languages, and fallbacks.

Sets up PR2 (ContentRouter scaffold) to call into Rust ContentDetector
in-process via the bridge.
@chopratejas chopratejas merged commit 035fa02 into main Apr 28, 2026
25 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant