perf(rust): tier-1 multi-worker wins — GIL release, sharded CCR store, single-serialize CCR write by chopratejas · Pull Request #293 · chopratejas/headroom

chopratejas · 2026-04-28T03:45:49Z

Summary

Audit-driven perf PR. Three independent hot-path fixes targeting concurrent-request throughput. Each is bench-measured below; the proxy hot path benefits from all three at once.

Stacked on top of #292. Audit notes for the full perf landscape (Tier 1/2/3) are in the commit body — this PR ships only Tier 1 (highest impact, surgical changes).

The three changes

1. PyO3 GIL release on heavy Rust compute

SmartCrusher.crush, smart_crush_content, crush_array_json, compact_document_json, and DiffCompressor.compress/compress_with_stats used to hold the GIL across the entire Rust call. A 100 ms compress() blocked every other Python thread for 100 ms — multi-worker uvicorn deployments serialized through SmartCrusher.

Wrap each compute call in py.allow_threads(|| ...). &str Python inputs are copied to owned String first (PyO3 ties them to GIL hold).

Verified end-to-end: 4 Python threads × 20 crushes — single-thread baseline 829 ms; 4-thread parallel 826 ms → 4.01× speedup, perfect parallel scaling.

2. `CcrStore`: `Mutex<HashMap>` → DashMap-backed sharded

Single Mutex was the dominant bottleneck under multi-worker load. Replace with DashMap (sharded concurrent map) + a small Mutex<VecDeque> for FIFO eviction order only.

A/B bench in crates/headroom-core/benches/ccr_store.rs (200 mixed put/get × N threads):

Threads	DashMap	Legacy Mutex	Speedup
1	63 µs	71 µs	1.1×
2	98 µs	194 µs	2.0×
4	178 µs	707 µs	4.0×
8	342 µs	1267 µs	3.7×

Legacy degrades ~linearly with thread count (one Mutex serializes everything); DashMap stays near-flat per-thread.

3. Single-serialize the lossy CCR payload

The lossy crush_array path was serializing the full array twice — once for the hash, once for the store write — each one allocating Value::Array(items.to_vec()) and deep-cloning the entire tree. For a 50-item dict array that's MBs of allocator pressure per call.

Refactor: canonical_array_json serializes &[Value] directly (same bytes, no wrapper allocation, no tree clone), call ONCE, hash + store reuse the same bytes. Hash format stable — all 17 parity fixtures still match byte-for-byte.

Test plan

cargo test --workspace — 479 lib + 14 integration + rest, all green
cargo clippy --workspace -- -D warnings clean
cargo fmt --check clean
make ci-precheck-python — 185 tests pass
New ccr.rs concurrent-stress test (8 threads × 200 puts/gets, every key readable)
parity-run smart_crusher — 17/17 fixtures match (proves single-serialize is hash-stable)
End-to-end Python GIL-release verification (script in commit body)
Bench A/B for store (committed at benches/ccr_store.rs for regression visibility)

Dependencies added

dashmap v6 (mature, widely used in tokio/linkerd; sharded concurrent HashMap)

What's NOT in this PR

Tier-2 / Tier-3 items deferred (full audit in PR9 commit body):

EmbeddingScorer's Mutex<TextEmbedding> pool (need to verify ORT internal threadpool first)
Redundant JSON parse in classify_cell + try_parse_json_container
Lossless wins still allocate items.to_vec() even when caller consumes .compacted instead
simd-json on the proxy hot path (when transforms get wired into the proxy)
TCP nodelay / keepalive on reqwest

…, single-serialize CCR write Three orthogonal hot-path fixes targeting concurrent-request throughput. Each is independently bench-measured below; the proxy hot path benefits from all three at once. == 1. PyO3 GIL release on heavy compute == PyO3 methods (crush, smart_crush_content, crush_array_json, compact_document_json, compress, compress_with_stats) used to hold the GIL across the entire Rust call. Result: a 100ms compress() blocked EVERY other Python thread for 100ms — multi-worker uvicorn deployments serialized through SmartCrusher. Wrap each compute call in `py.allow_threads(|| ...)`. Inputs (`&str` from Python) are copied to owned `String` first because PyO3 ties them to the GIL hold. PyDict construction stays on the GIL side. Measured: 4 Python threads each running 20 crushes: before (GIL held): ~3.3s wall (serialized — equivalent to 4×0.83s) after (allow_threads): 826ms wall (4.01x speedup, perfect parallel) == 2. CcrStore: Mutex<HashMap> -> DashMap-backed sharded == Single Mutex was the dominant bottleneck under multi-worker load — every put/get serialized through one lock. Replace with DashMap (sharded concurrent map, lock-free reads within a shard) plus a separate small Mutex<VecDeque> for FIFO insertion-order eviction. Reads of distinct keys never contend; writes only contend during the brief order-queue push or capacity-sweep. A/B bench (200 mixed put/get ops × N threads, in benches/ccr_store.rs): Threads | DashMap Legacy Mutex Speedup ------------------------------------------- 1 | 63 µs 71 µs 1.13x 2 | 98 µs 194 µs 2.0x 4 | 178 µs 707 µs 4.0x 8 | 342 µs 1267 µs 3.7x Legacy degrades ~linearly with thread count; DashMap stays near-flat per-thread. Real multi-worker scaling. == 3. Single-serialize the lossy CCR payload == The lossy `crush_array` path used to serialize the full array TWICE: once in `hash_array_for_ccr` (allocates `Value::Array(items.to_vec())`, deep-clones every Value subtree, then serializes), and a second time in the store-write site. For a 50-item dict array that's ~MB of allocator pressure per crushed array. Introduce `canonical_array_json` (serializes `&[Value]` directly — same bytes as `Value::Array(items.to_vec())` but no wrapper allocation + no tree clone), call it ONCE per lossy path, then both hash and store from those same bytes. Hash-format stable — all 17 parity fixtures match byte-for-byte. == Tests == - 8 ccr.rs unit tests including a new concurrent-stress test (8 threads × 200 puts/gets, every key readable afterwards) - 14 ccr_roundtrip integration tests stay green - parity-run smart_crusher: 17/17 fixtures match - 479 lib + 14 integration + 185 Python tests all pass - New benches/ccr_store.rs runs the A/B and is committed for regression visibility == Dependencies added == - dashmap v6 (mature, widely-used in tokio/linkerd ecosystem)

chopratejas force-pushed the rust-perf-tier1-gil-dashmap-single-serialize branch from ab92681 to 29aadb1 Compare April 28, 2026 05:27

chopratejas merged commit f81b88b into main Apr 28, 2026
25 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

perf(rust): tier-1 multi-worker wins — GIL release, sharded CCR store, single-serialize CCR write#293

perf(rust): tier-1 multi-worker wins — GIL release, sharded CCR store, single-serialize CCR write#293
chopratejas merged 1 commit intomainfrom
rust-perf-tier1-gil-dashmap-single-serialize

chopratejas commented Apr 28, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

chopratejas commented Apr 28, 2026

Summary

The three changes

1. PyO3 GIL release on heavy Rust compute

2. CcrStore: Mutex<HashMap> → DashMap-backed sharded

3. Single-serialize the lossy CCR payload

Test plan

Dependencies added

What's NOT in this PR

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

2. `CcrStore`: `Mutex<HashMap>` → DashMap-backed sharded