Replies: 3 comments
-
|
0.9.0 is well over a year old at this point so I would say updating to 0.10.0 would be a good first step, I don't believe there should be any major breaking changes considering what you've stated about your current setup, beyond how CLI arguments are passed now. The biggest change with regard to throughput would be the addition of the alternative XDP-based packet processor that can be used instead of io-uring. It takes some additional setup due to requiring elevated privileges, and some features, notably XDP_ZEROCOPY may not be available depending on your NIC/driver. If you still wanted to use io-uring, there was only one substantive change to it since 0.9.0 (as far as I remember) in #1361, but that's not in the 0.10.0 release so it would be slightly harder to check that particular change to see if it helped increase throughput as I'm assuming it is probably not possible to just cherry pick that particular change on top of 0.10.0. |
Beta Was this translation helpful? Give feedback.
-
|
I'm very excited to switch to XDP-based packet processing if it's possible, test it in our use case, and share the results in this issue. |
Beta Was this translation helpful? Give feedback.
-
|
@mohbakh I just added a changelog entry for 0.10.0. It has some info. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Summary
We run Quilkin v0.9.0 as a UDP proxy pool in front of an Agones fleet
(token-routed, 1-byte suffix → upstream GameServer). Throughput per
pod and per core is much lower than we'd expected, while CPU consumption
is high. We'd like to know whether the numbers below are in the expected
range for v0.9.0, and what — short of upgrading — we can reasonably do
about it.
Headline numbers (production, current):
strace -con one pod: 44% of syscall CPU infutexunder sustained load (~19k futex calls/sec)We'd appreciate a sanity check: is this within the expected envelope for
v0.9.0 with the configuration below, or are we doing something wrong?
Architecture
Topology
Filter chain
That's it — capture the last byte, route to the matching endpoint.
Proxy pod spec (relevant bits)
hostNetworkis not enabled (we tested it; effect was small — see below).Workload shape
Production metrics (right now)
Pulled from Prometheus / our internal dashboards.
Per-pod (averaged across 25 pods)
rate(quilkin_packets_total[5m])(read + write)rate(container_cpu_usage_seconds_total[5m])quilkin_session_activecontainer_memory_working_set_bytesPer-pod CPU range across the pool
CPU is well above the HPA target (70% of 300m = 210m), which is why the
HPA has scaled the pool to its 25-replica ceiling.
Filter latency (p99 from the read direction)
The capture and token_router filters are both reported below 125 µs
(the smallest histogram bucket) ~99.97% of the time — i.e. the filters
themselves are not where time is being spent.
Drop sources (
quilkin_packets_dropped_total, bysourcelabel)We observe non-trivial drops with
sourceshowing channel-related causes(
channel full,downstream channel full) under load, notfilter::token_router::no endpoint match. This is what made us think thebottleneck is the internal pipeline, not the routing or filter work.
Process snapshot
strace -c -p <pid>on one pod under steady production load:futexepoll_waitrecvfrom/sendtoIsolation experiment (no LB, no Agones, no xDS)
Last week we ran a controlled experiment to rule out the LB and our SDN.
Setup:
0x01).We tried four configurations:
Per-pod ceiling is ~5k pps successfully forwarded, at ~10k pps/core,
regardless of filters, worker count, or hostNetwork. Excess offered load
becomes drops, not throughput. RTT under saturation rises into the
50–80 ms range.
Full results (with raw data) are reproducible if helpful.
What we already tried
--workers 1vs--workers 2. Run B was consistently as good asor better than the production baseline (run A). We have not yet
landed
--workers=1in production but plan to.hostNetwork: true(run D). ~10% throughput improvement at thesaturation point, no order-of-magnitude change.
are not the dominant cost.
TOKIO_WORKER_THREADS=2. Without this, on a 64 vCPU node thedefault Tokio multi-thread runtime spawns ~64 worker threads inside our
1-CPU-quota cgroup, which produced enormous cross-runtime contention.
Setting it to 2 lowered futex time materially. We left it at 2 because
pinning further (1) didn't help and broke the recv/send split.
throughput (the ceiling is intrinsic to the binary at this load
shape — adding cores beyond the runtime's worker count just leaves
them idle).
Questions
configuration (Capture + TokenRouter, xDS endpoints, default tokio
runtime, pod network)?
v0.9.0 source we noticed the io_uring path uses a
tokio::sync::mpsc::channel::<RecvPacket>(1)— bounded to 1 — to handpackets from the io_uring OS thread to the async processor
(
src/components/proxy/io_uring_shared.rs:280, 660). On everyreceived packet the io_uring thread
blocking_sends through thatchannel; we suspect this is the dominant futex source. Are we reading
it correctly?
"small packet, single filter chain" workloads we could compare
against? We couldn't find a published reference benchmark.
recv→filter→send pipeline on
mainsince v0.9.0. Would upgradingsolve this, and if so, which release first contained the change?
We'd like to make a single jump rather than chase intermediate
versions.
tried that would meaningfully change the per-pod ceiling? We have
already tested workers, filters off, and hostNetwork (see above).
Environment
manage agones)We'd be happy to share
cargo flamegraph/perf recordoutput, moredetailed metric dumps, or run a specific debug build if it helps narrow
this down.
Thanks for the project.
Beta Was this translation helpful? Give feedback.
All reactions