Skip to content

perf(rust): tier-1 multi-worker wins — GIL release, sharded CCR store, single-serialize CCR write#293

Merged
chopratejas merged 1 commit intomainfrom
rust-perf-tier1-gil-dashmap-single-serialize
Apr 28, 2026
Merged

perf(rust): tier-1 multi-worker wins — GIL release, sharded CCR store, single-serialize CCR write#293
chopratejas merged 1 commit intomainfrom
rust-perf-tier1-gil-dashmap-single-serialize

Conversation

@chopratejas
Copy link
Copy Markdown
Owner

Summary

Audit-driven perf PR. Three independent hot-path fixes targeting concurrent-request throughput. Each is bench-measured below; the proxy hot path benefits from all three at once.

Stacked on top of #292. Audit notes for the full perf landscape (Tier 1/2/3) are in the commit body — this PR ships only Tier 1 (highest impact, surgical changes).

The three changes

1. PyO3 GIL release on heavy Rust compute

SmartCrusher.crush, smart_crush_content, crush_array_json, compact_document_json, and DiffCompressor.compress/compress_with_stats used to hold the GIL across the entire Rust call. A 100 ms compress() blocked every other Python thread for 100 ms — multi-worker uvicorn deployments serialized through SmartCrusher.

Wrap each compute call in py.allow_threads(|| ...). &str Python inputs are copied to owned String first (PyO3 ties them to GIL hold).

Verified end-to-end: 4 Python threads × 20 crushes — single-thread baseline 829 ms; 4-thread parallel 826 ms → 4.01× speedup, perfect parallel scaling.

2. CcrStore: Mutex<HashMap> → DashMap-backed sharded

Single Mutex was the dominant bottleneck under multi-worker load. Replace with DashMap (sharded concurrent map) + a small Mutex<VecDeque> for FIFO eviction order only.

A/B bench in crates/headroom-core/benches/ccr_store.rs (200 mixed put/get × N threads):

Threads DashMap Legacy Mutex Speedup
1 63 µs 71 µs 1.1×
2 98 µs 194 µs 2.0×
4 178 µs 707 µs 4.0×
8 342 µs 1267 µs 3.7×

Legacy degrades ~linearly with thread count (one Mutex serializes everything); DashMap stays near-flat per-thread.

3. Single-serialize the lossy CCR payload

The lossy crush_array path was serializing the full array twice — once for the hash, once for the store write — each one allocating Value::Array(items.to_vec()) and deep-cloning the entire tree. For a 50-item dict array that's MBs of allocator pressure per call.

Refactor: canonical_array_json serializes &[Value] directly (same bytes, no wrapper allocation, no tree clone), call ONCE, hash + store reuse the same bytes. Hash format stable — all 17 parity fixtures still match byte-for-byte.

Test plan

  • cargo test --workspace479 lib + 14 integration + rest, all green
  • cargo clippy --workspace -- -D warnings clean
  • cargo fmt --check clean
  • make ci-precheck-python185 tests pass
  • New ccr.rs concurrent-stress test (8 threads × 200 puts/gets, every key readable)
  • parity-run smart_crusher — 17/17 fixtures match (proves single-serialize is hash-stable)
  • End-to-end Python GIL-release verification (script in commit body)
  • Bench A/B for store (committed at benches/ccr_store.rs for regression visibility)

Dependencies added

  • dashmap v6 (mature, widely used in tokio/linkerd; sharded concurrent HashMap)

What's NOT in this PR

Tier-2 / Tier-3 items deferred (full audit in PR9 commit body):

  • EmbeddingScorer's Mutex<TextEmbedding> pool (need to verify ORT internal threadpool first)
  • Redundant JSON parse in classify_cell + try_parse_json_container
  • Lossless wins still allocate items.to_vec() even when caller consumes .compacted instead
  • simd-json on the proxy hot path (when transforms get wired into the proxy)
  • TCP nodelay / keepalive on reqwest

…, single-serialize CCR write

Three orthogonal hot-path fixes targeting concurrent-request throughput.
Each is independently bench-measured below; the proxy hot path benefits
from all three at once.

== 1. PyO3 GIL release on heavy compute ==

PyO3 methods (crush, smart_crush_content, crush_array_json,
compact_document_json, compress, compress_with_stats) used to hold the
GIL across the entire Rust call. Result: a 100ms compress() blocked
EVERY other Python thread for 100ms — multi-worker uvicorn deployments
serialized through SmartCrusher.

Wrap each compute call in `py.allow_threads(|| ...)`. Inputs (`&str`
from Python) are copied to owned `String` first because PyO3 ties them
to the GIL hold. PyDict construction stays on the GIL side.

Measured: 4 Python threads each running 20 crushes:
  before (GIL held): ~3.3s wall    (serialized — equivalent to 4×0.83s)
  after (allow_threads): 826ms wall (4.01x speedup, perfect parallel)

== 2. CcrStore: Mutex<HashMap> -> DashMap-backed sharded ==

Single Mutex was the dominant bottleneck under multi-worker load — every
put/get serialized through one lock. Replace with DashMap (sharded
concurrent map, lock-free reads within a shard) plus a separate
small Mutex<VecDeque> for FIFO insertion-order eviction. Reads of
distinct keys never contend; writes only contend during the brief
order-queue push or capacity-sweep.

A/B bench (200 mixed put/get ops × N threads, in benches/ccr_store.rs):
  Threads | DashMap   Legacy Mutex  Speedup
  -------------------------------------------
       1  |   63 µs        71 µs       1.13x
       2  |   98 µs       194 µs       2.0x
       4  |  178 µs       707 µs       4.0x
       8  |  342 µs      1267 µs       3.7x

Legacy degrades ~linearly with thread count; DashMap stays near-flat
per-thread. Real multi-worker scaling.

== 3. Single-serialize the lossy CCR payload ==

The lossy `crush_array` path used to serialize the full array TWICE:
once in `hash_array_for_ccr` (allocates `Value::Array(items.to_vec())`,
deep-clones every Value subtree, then serializes), and a second time
in the store-write site. For a 50-item dict array that's ~MB of
allocator pressure per crushed array.

Introduce `canonical_array_json` (serializes `&[Value]` directly — same
bytes as `Value::Array(items.to_vec())` but no wrapper allocation +
no tree clone), call it ONCE per lossy path, then both hash and store
from those same bytes. Hash-format stable — all 17 parity fixtures
match byte-for-byte.

== Tests ==

- 8 ccr.rs unit tests including a new concurrent-stress test (8 threads
  × 200 puts/gets, every key readable afterwards)
- 14 ccr_roundtrip integration tests stay green
- parity-run smart_crusher: 17/17 fixtures match
- 479 lib + 14 integration + 185 Python tests all pass
- New benches/ccr_store.rs runs the A/B and is committed for regression
  visibility

== Dependencies added ==

- dashmap v6  (mature, widely-used in tokio/linkerd ecosystem)
@chopratejas chopratejas force-pushed the rust-perf-tier1-gil-dashmap-single-serialize branch from ab92681 to 29aadb1 Compare April 28, 2026 05:27
@chopratejas chopratejas merged commit f81b88b into main Apr 28, 2026
25 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant