perf(rust): tier-1 multi-worker wins — GIL release, sharded CCR store, single-serialize CCR write#293
Merged
chopratejas merged 1 commit intomainfrom Apr 28, 2026
Conversation
…, single-serialize CCR write
Three orthogonal hot-path fixes targeting concurrent-request throughput.
Each is independently bench-measured below; the proxy hot path benefits
from all three at once.
== 1. PyO3 GIL release on heavy compute ==
PyO3 methods (crush, smart_crush_content, crush_array_json,
compact_document_json, compress, compress_with_stats) used to hold the
GIL across the entire Rust call. Result: a 100ms compress() blocked
EVERY other Python thread for 100ms — multi-worker uvicorn deployments
serialized through SmartCrusher.
Wrap each compute call in `py.allow_threads(|| ...)`. Inputs (`&str`
from Python) are copied to owned `String` first because PyO3 ties them
to the GIL hold. PyDict construction stays on the GIL side.
Measured: 4 Python threads each running 20 crushes:
before (GIL held): ~3.3s wall (serialized — equivalent to 4×0.83s)
after (allow_threads): 826ms wall (4.01x speedup, perfect parallel)
== 2. CcrStore: Mutex<HashMap> -> DashMap-backed sharded ==
Single Mutex was the dominant bottleneck under multi-worker load — every
put/get serialized through one lock. Replace with DashMap (sharded
concurrent map, lock-free reads within a shard) plus a separate
small Mutex<VecDeque> for FIFO insertion-order eviction. Reads of
distinct keys never contend; writes only contend during the brief
order-queue push or capacity-sweep.
A/B bench (200 mixed put/get ops × N threads, in benches/ccr_store.rs):
Threads | DashMap Legacy Mutex Speedup
-------------------------------------------
1 | 63 µs 71 µs 1.13x
2 | 98 µs 194 µs 2.0x
4 | 178 µs 707 µs 4.0x
8 | 342 µs 1267 µs 3.7x
Legacy degrades ~linearly with thread count; DashMap stays near-flat
per-thread. Real multi-worker scaling.
== 3. Single-serialize the lossy CCR payload ==
The lossy `crush_array` path used to serialize the full array TWICE:
once in `hash_array_for_ccr` (allocates `Value::Array(items.to_vec())`,
deep-clones every Value subtree, then serializes), and a second time
in the store-write site. For a 50-item dict array that's ~MB of
allocator pressure per crushed array.
Introduce `canonical_array_json` (serializes `&[Value]` directly — same
bytes as `Value::Array(items.to_vec())` but no wrapper allocation +
no tree clone), call it ONCE per lossy path, then both hash and store
from those same bytes. Hash-format stable — all 17 parity fixtures
match byte-for-byte.
== Tests ==
- 8 ccr.rs unit tests including a new concurrent-stress test (8 threads
× 200 puts/gets, every key readable afterwards)
- 14 ccr_roundtrip integration tests stay green
- parity-run smart_crusher: 17/17 fixtures match
- 479 lib + 14 integration + 185 Python tests all pass
- New benches/ccr_store.rs runs the A/B and is committed for regression
visibility
== Dependencies added ==
- dashmap v6 (mature, widely-used in tokio/linkerd ecosystem)
ab92681 to
29aadb1
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Audit-driven perf PR. Three independent hot-path fixes targeting concurrent-request throughput. Each is bench-measured below; the proxy hot path benefits from all three at once.
Stacked on top of #292. Audit notes for the full perf landscape (Tier 1/2/3) are in the commit body — this PR ships only Tier 1 (highest impact, surgical changes).
The three changes
1. PyO3 GIL release on heavy Rust compute
SmartCrusher.crush,smart_crush_content,crush_array_json,compact_document_json, andDiffCompressor.compress/compress_with_statsused to hold the GIL across the entire Rust call. A 100 mscompress()blocked every other Python thread for 100 ms — multi-worker uvicorn deployments serialized through SmartCrusher.Wrap each compute call in
py.allow_threads(|| ...).&strPython inputs are copied to ownedStringfirst (PyO3 ties them to GIL hold).Verified end-to-end: 4 Python threads × 20 crushes — single-thread baseline 829 ms; 4-thread parallel 826 ms → 4.01× speedup, perfect parallel scaling.
2.
CcrStore:Mutex<HashMap>→ DashMap-backed shardedSingle
Mutexwas the dominant bottleneck under multi-worker load. Replace withDashMap(sharded concurrent map) + a smallMutex<VecDeque>for FIFO eviction order only.A/B bench in
crates/headroom-core/benches/ccr_store.rs(200 mixed put/get × N threads):Legacy degrades ~linearly with thread count (one Mutex serializes everything); DashMap stays near-flat per-thread.
3. Single-serialize the lossy CCR payload
The lossy
crush_arraypath was serializing the full array twice — once for the hash, once for the store write — each one allocatingValue::Array(items.to_vec())and deep-cloning the entire tree. For a 50-item dict array that's MBs of allocator pressure per call.Refactor:
canonical_array_jsonserializes&[Value]directly (same bytes, no wrapper allocation, no tree clone), call ONCE, hash + store reuse the same bytes. Hash format stable — all 17 parity fixtures still match byte-for-byte.Test plan
cargo test --workspace— 479 lib + 14 integration + rest, all greencargo clippy --workspace -- -D warningscleancargo fmt --checkcleanmake ci-precheck-python— 185 tests passparity-run smart_crusher— 17/17 fixtures match (proves single-serialize is hash-stable)benches/ccr_store.rsfor regression visibility)Dependencies added
dashmapv6 (mature, widely used in tokio/linkerd; sharded concurrent HashMap)What's NOT in this PR
Tier-2 / Tier-3 items deferred (full audit in PR9 commit body):
Mutex<TextEmbedding>pool (need to verify ORT internal threadpool first)classify_cell+try_parse_json_containeritems.to_vec()even when caller consumes.compactedinstead