chore(rust): port ContentDetector to Rust + parity + PyO3 bridge#295
Merged
chopratejas merged 1 commit intomainfrom Apr 28, 2026
Merged
chore(rust): port ContentDetector to Rust + parity + PyO3 bridge#295chopratejas merged 1 commit intomainfrom
chopratejas merged 1 commit intomainfrom
Conversation
Faithful port of `headroom/transforms/content_detector.py` into `headroom-core`. Same regex patterns, dispatch order, confidence formulas, and line-count caps; lockstep with Python via 21 recorded parity fixtures (every dispatch branch exercised). - crates/headroom-core/src/transforms/content_detector.rs: regex-only detector (no ML); ContentType/DetectionResult mirror Python's enum + dataclass surface; metadata uses serde_json::Map for clean PyO3 bridging. - Tie-break in code detection: track scores in first-match insertion order (matches Python dict iteration semantics on `max()` ties). - TypeScript second pattern is start-anchored — Python's `pattern.match(line)` is start-anchored, but the regex crate's `is_match` is unanchored, so the literal `:` prefix is required for parity. - crates/headroom-parity: ContentDetectorComparator + universal f64 normalization in `compare_fixture` (serde_json's lossy parse vs full-precision serialize creates a 1-ULP asymmetry that broke comparison; round-trip the actual through to_string/from_str). - crates/headroom-py: detect_content_type/is_json_array_of_dicts and PyDetectionResult exposed via PyO3 with GIL released during scan. - tests/parity/recorder.py: new `_wrap_function` for free-function recording; content_detector hook + 21 varied inputs covering JSON arrays, diffs, HTML, search, build/log, six languages, and fallbacks. Sets up PR2 (ContentRouter scaffold) to call into Rust ContentDetector in-process via the bridge.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
PR1 of the ContentRouter migration — ports
headroom/transforms/content_detector.pytoheadroom-corewith byte-equal parity, adds the parity comparator, and exposesdetect_content_type+is_json_array_of_dictsto Python via PyO3 so PR2 (ContentRouter scaffold) can dispatch into Rust in-process.regex::is_matchis not start-anchored, but Python'spattern.match(line)is). Anchored explicitly sois_matchmatches Python's narrow semantics.max()tie-break in code detection must follow Python dict insertion order, not registration order. Rust'sIterator::max_byreturns last on ties; switched to score tracking withfind(score == max)for first-on-tie behavior.serde_json(withoutarbitrary_precision) parses fixture floats lossily but emitsf64s at full precision via thejson!macro. Round-trip the actual throughto_string+from_strso both sides go through the same parser. Future comparators withf64outputs benefit automatically.PyO3 bridge:
headroom._core.detect_content_type(content) -> DetectionResultheadroom._core.is_json_array_of_dicts(content) -> boolDetectionResultexposes.content_type(string tag),.confidence(f64),.metadata(dict). GIL released during scan.Test plan
content_detector.rs(all branches + edge cases)cargo run -p headroom-parity --bin parity-run -- run --only content_detector→ 21/21 matchedmake ci-precheckPASSED (cargo fmt + clippy + workspace tests + 185 Python tests + commitlint)Up next (per the 4-PR plan)
magikacrate