High CPU usage at low pps in production: ~5.6k pps/core, dropped sessions, 44% futex time — is this expected for v0.9.0? #1388

mohbakh · 2026-04-28T10:12:36Z

mohbakh
Apr 28, 2026

Summary

We run Quilkin v0.9.0 as a UDP proxy pool in front of an Agones fleet
(token-routed, 1-byte suffix → upstream GameServer). Throughput per
pod and per core is much lower than we'd expected, while CPU consumption
is high. We'd like to know whether the numbers below are in the expected
range for v0.9.0, and what — short of upgrading — we can reasonably do
about it.

Headline numbers (production, current):

25 proxy pods, each handling ~2,480 pps (read + write combined)
Per-pod CPU: ~0.44 cores avg (range 0.34–0.56)
~5.6k pps/core end-to-end efficiency (production)
Isolated benchmark (single proxy + echo server, no LB): ~10k pps/core, hard ceiling at ~5k pps successfully forwarded per pod
Cluster aggregate: ~63k pps across the pool, ~11 CPU cores total burning to deliver it
strace -c on one pod: 44% of syscall CPU in futex under sustained load (~19k futex calls/sec)

We'd appreciate a sanity check: is this within the expected envelope for
v0.9.0 with the configuration below, or are we doing something wrong?

Architecture

Topology

         clients (UDP)
             │
             ▼  Sotoon (Iranian SaaS) UDP LoadBalancer
             │  externalTrafficPolicy: Cluster (LB does NOT honor Local)
             │  → SNAT'd to LB IP, source-port preserved
             │
             ▼  Service (clusterIP, type LoadBalancer), port 7777/UDP
        ┌────────────────────────┐
        │ Deployment (10–25)     │   HPA: target 70% CPU vs request (300m)
        │  quilkin proxy         │   resources.requests.cpu: 300m
        │  args: proxy           │   resources.requests.memory: 256Mi
        │    --management-server │   resources.limits.memory: 1Gi
        │    http://quilkin-...  │   env: TOKIO_WORKER_THREADS=2
        └─────────┬──────────────┘
                  │ xDS (gRPC)
                  ▼
        ┌────────────────────────┐
        │ Deployment (1)         │   args: manage agones
        │ quilkin-manage-agones  │
        │ (xDS endpoint source)  │
        └─────────┬──────────────┘
                  │ k8s API watch
                  ▼
        ┌────────────────────────┐
        │ Agones Fleet           │   ~21 Allocated GameServers
        │ (cities namespace)     │   tokens via quilkin.dev/tokens
        │                        │   annotation, 1 byte each
        └────────────────────────┘

Filter chain

filters:
  - name: quilkin.filters.capture.v1alpha1.Capture
    config:
      suffix:
        size: 1
        remove: true
  - name: quilkin.filters.token_router.v1alpha1.TokenRouter

That's it — capture the last byte, route to the matching endpoint.

Proxy pod spec (relevant bits)

containers:
  - name: quilkin
    image: quilkin:0.9.0  # mirrored from gcr.io/agones-images/quilkin:0.9.0
    args: ["/quilkin", "proxy", "--management-server", "http://quilkin-manage-agones:80"]
    env:
      - name: RUST_LOG
        value: error
      - name: TOKIO_WORKER_THREADS
        value: "2"   # see "What we already tried" below
    resources:
      requests:
        cpu: 300m
        memory: 256Mi
      limits:
        memory: 1Gi
    ports:
      - containerPort: 7777
        protocol: UDP

hostNetwork is not enabled (we tested it; effect was small — see below).

Workload shape

Packet size: ~100 bytes payload (game traffic, includes the 1-byte token suffix).
Sessions: ~61 concurrent per pod, ~1,500 across the cluster.
Endpoints: 21 Agones-Allocated GameServers, churn measured in events/day (rolling updates) — essentially static at packet-rate timescales.
Traffic shape: bidirectional, low-jitter game traffic. read:write rate is ~1:1.
Nodes: 64 vCPU bare-metal hosts running ~30 pods each.

Production metrics (right now)

Pulled from Prometheus / our internal dashboards.

Per-pod (averaged across 25 pods)

Metric	Value
`rate(quilkin_packets_total[5m])` (read + write)	~2,480 pps
`rate(container_cpu_usage_seconds_total[5m])`	0.44 cores
`quilkin_session_active`	60.8
`container_memory_working_set_bytes`	63 MiB
Effective pps per CPU core	~5,640

Per-pod CPU range across the pool

min:  0.34 cores  (-62vm6)
p50:  0.44 cores
max:  0.56 cores  (-r2zpz)

CPU is well above the HPA target (70% of 300m = 210m), which is why the
HPA has scaled the pool to its 25-replica ceiling.

Filter latency (p99 from the read direction)

The capture and token_router filters are both reported below 125 µs
(the smallest histogram bucket) ~99.97% of the time — i.e. the filters
themselves are not where time is being spent.

Drop sources (`quilkin_packets_dropped_total`, by `source` label)

We observe non-trivial drops with source showing channel-related causes
(channel full, downstream channel full) under load, not
filter::token_router::no endpoint match. This is what made us think the
bottleneck is the internal pipeline, not the routing or filter work.

Process snapshot

strace -c -p <pid> on one pod under steady production load:

Syscall	% time	calls / sec
`futex`	44%	~19,000
`epoll_wait`	~22%	—
`recvfrom` / `sendto`	~18%	matches packet rate
(others)	rest	—

Isolation experiment (no LB, no Agones, no xDS)

Last week we ran a controlled experiment to rule out the LB and our SDN.
Setup:

1 Quilkin v0.9.0 proxy pod, static config loaded from a file (no xDS).
1 echo-server pod (Go, single goroutine — confirmed it can sustain >100k pps).
1 stress-client pod (Go, 4 send goroutines + recv goroutine, 100 B payloads, 1-byte token = 0x01).
All in-cluster, pod network, no NAT.
Rate ladder: 1k → 5k → 10k → 20k → 50k → 100k pps, 20 s per step.

We tried four configurations:

Run	Workers	Filters	Network	Best received pps
A (baseline = production)	2	Capture+TokenRouter	pod	4,438 / s
B	1	Capture+TokenRouter	pod	4,745 / s
C	1	none	pod	4,798 / s
D	1	Capture+TokenRouter	hostNetwork	4,833 / s

Per-pod ceiling is ~5k pps successfully forwarded, at ~10k pps/core,
regardless of filters, worker count, or hostNetwork. Excess offered load
becomes drops, not throughput. RTT under saturation rises into the
50–80 ms range.

Full results (with raw data) are reproducible if helpful.

What we already tried

--workers 1 vs --workers 2. Run B was consistently as good as
or better than the production baseline (run A). We have not yet
landed --workers=1 in production but plan to.
hostNetwork: true (run D). ~10% throughput improvement at the
saturation point, no order-of-magnitude change.
Empty filter chain (run C). ~10% improvement at saturation. Filters
are not the dominant cost.
TOKIO_WORKER_THREADS=2. Without this, on a 64 vCPU node the
default Tokio multi-thread runtime spawns ~64 worker threads inside our
1-CPU-quota cgroup, which produced enormous cross-runtime contention.
Setting it to 2 lowered futex time materially. We left it at 2 because
pinning further (1) didn't help and broke the recv/send split.
Increased CPU request / limit. Adding CPU does not unblock per-pod
throughput (the ceiling is intrinsic to the binary at this load
shape — adding cores beyond the runtime's worker count just leaves
them idle).

Questions

Is ~5–10k pps/core the expected envelope for v0.9.0 with this
configuration (Capture + TokenRouter, xDS endpoints, default tokio
runtime, pod network)?
Is the 44% futex time expected at this packet rate? Reading the
v0.9.0 source we noticed the io_uring path uses a
tokio::sync::mpsc::channel::<RecvPacket>(1) — bounded to 1 — to hand
packets from the io_uring OS thread to the async processor
(src/components/proxy/io_uring_shared.rs:280, 660). On every
received packet the io_uring thread blocking_sends through that
channel; we suspect this is the dominant futex source. Are we reading
it correctly?
Do you have a per-pod-pps target you publish for v0.9.0 under
"small packet, single filter chain" workloads we could compare
against? We couldn't find a published reference benchmark.
Recommended path forward. We see major refactors to the
recv→filter→send pipeline on main since v0.9.0. Would upgrading
solve this, and if so, which release first contained the change?
We'd like to make a single jump rather than chase intermediate
versions.
Failing that, are there configuration knobs in v0.9.0 we haven't
tried that would meaningfully change the per-pod ceiling? We have
already tested workers, filters off, and hostNetwork (see above).

Environment

Quilkin: v0.9.0 (proxy + xDS manage agones)
Agones: 1.50.0
Kubernetes: 1.30
Container runtime: containerd 1.7
Node OS: kernel 6.x, x86_64, 64 vCPU bare-metal
Network: SDN (calico-style), pod network for proxies (not hostNetwork)
Cluster size: ~21 Agones GameServers, ~1,500 concurrent UDP sessions
HPA: 10–25 replicas, 70% CPU target

We'd be happy to share cargo flamegraph / perf record output, more
detailed metric dumps, or run a specific debug build if it helps narrow
this down.

Thanks for the project.

Jake-Shadle · 2026-04-28T11:02:32Z

Jake-Shadle
Apr 28, 2026
Maintainer

0.9.0 is well over a year old at this point so I would say updating to 0.10.0 would be a good first step, I don't believe there should be any major breaking changes considering what you've stated about your current setup, beyond how CLI arguments are passed now.

The biggest change with regard to throughput would be the addition of the alternative XDP-based packet processor that can be used instead of io-uring. It takes some additional setup due to requiring elevated privileges, and some features, notably XDP_ZEROCOPY may not be available depending on your NIC/driver.

If you still wanted to use io-uring, there was only one substantive change to it since 0.9.0 (as far as I remember) in #1361, but that's not in the 0.10.0 release so it would be slightly harder to check that particular change to see if it helped increase throughput as I'm assuming it is probably not possible to just cherry pick that particular change on top of 0.10.0.

0 replies

mohbakh · 2026-04-28T16:26:19Z

mohbakh
Apr 28, 2026
Author

I'm very excited to switch to XDP-based packet processing if it's possible, test it in our use case, and share the results in this issue.
I hadn't noticed the XDP support in the 0.10.0 changelog before, but now that you've mentioned it, I took a look at the commits and found it.
However, I couldn't find any documentation or setup guide for the XDP approach, or for version 0.10.0 in general.
I'd appreciate it if you could help me figure out how to do so.
@Jake-Shadle

0 replies

XAMPPRocky · 2026-04-30T12:18:17Z

XAMPPRocky
Apr 30, 2026
Maintainer

@mohbakh I just added a changelog entry for 0.10.0. It has some info.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

High CPU usage at low pps in production: ~5.6k pps/core, dropped sessions, 44% futex time — is this expected for v0.9.0? #1388

Uh oh!

{{title}}

Uh oh!

Replies: 3 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

High CPU usage at low pps in production: ~5.6k pps/core, dropped sessions, 44% futex time — is this expected for v0.9.0? #1388

Uh oh!

mohbakh Apr 28, 2026

Summary

Architecture

Topology

Filter chain

Proxy pod spec (relevant bits)

Workload shape

Production metrics (right now)

Per-pod (averaged across 25 pods)

Per-pod CPU range across the pool

Filter latency (p99 from the read direction)

Drop sources (quilkin_packets_dropped_total, by source label)

Process snapshot

Isolation experiment (no LB, no Agones, no xDS)

What we already tried

Questions

Environment

Replies: 3 comments

Uh oh!

Jake-Shadle Apr 28, 2026 Maintainer

Uh oh!

Uh oh!

mohbakh Apr 28, 2026 Author

Uh oh!

XAMPPRocky Apr 30, 2026 Maintainer

mohbakh
Apr 28, 2026

Drop sources (`quilkin_packets_dropped_total`, by `source` label)

Jake-Shadle
Apr 28, 2026
Maintainer

mohbakh
Apr 28, 2026
Author

XAMPPRocky
Apr 30, 2026
Maintainer