Skip to content

[Performance]: OpenVINO CPU Scaling Bottleneck #35133

@carun

Description

@carun

OpenVINO Version

2025.4.0

Operating System

Ubuntu 22.04 (LTS)

Device used for inference

CPU

OpenVINO installation

Docker

Programming Language

C++

Hardware Architecture

x86 (64 bits)

Model used

ResNet50

Model quantization

Yes

Target Platform

  • GCP c2-standard-8
  • GCP c2-standard-60

Performance issue description

OpenVINO's benchmark_app fails to scale CPU utilization beyond ~23% on a 60-vCPU machine, while achieving 56% on a 4-core machine. This is fully reproducible with the public ResNet-50 model.

Step-by-step reproduction

Environment

Machine Phys Cores vCPUs Sockets NUMA Nodes CPU
GCP c2-standard-60 30 60 2 2 Intel Xeon @ 3.10GHz
GCP c2-standard-8 4 8 1 1 Intel Xeon @ 3.10GHz (same SKU)

Container: openvino/ubuntu24_dev:2025.4.0 via podman
Model: ResNet-50 v1.7 - 98 MB, 224x224 input
Source: https://github.com/onnx/models/raw/main/validated/vision/classification/resnet/model/resnet50-v1-7.onnx

Reproduction Steps

wget https://github.com/onnx/models/raw/main/validated/vision/classification/resnet/model/resnet50-v1-7.onnx

# On a machine with 30+ physical cores:
podman run --rm \
  -v $(pwd):/models:ro \
  openvino/ubuntu24_dev:2025.4.0 \
  benchmark_app -m /models/resnet50-v1-7.onnx -d CPU -niter 200 \
  -shape "[1,3,224,224]" -hint throughput

# Monitor CPU in a separate terminal:
mpstat 1

# Observed:  ~14-23% CPU utilization, ~489 FPS (on 60-vCPU machine)
# Expected:  >60% CPU utilization

Data

1. Machine Size Comparison

Machine vCPUs Config FPS Latency CPU % FPS/vCPU
c2-standard-8 8 hint=throughput 85.55 45.91 ms 56.4% 10.69
c2-standard-60 60 hint=throughput 488.70 60.56 ms 13.9% 8.15
c2-standard-8 8 hint=latency 72.62 13.59 ms 42.4% 9.08
c2-standard-60 60 hint=latency 158.93 6.06 ms 17.6% 2.65

7.5x more vCPUs yields 5.7x throughput (hint=throughput) and only 2.2x (hint=latency). Per-vCPU efficiency drops 24-71% on the larger machine.

2. nstreams Sweep (c2-standard-60)

nstreams FPS Median Latency (ms) CPU %
1 158.86 6.10 17.7%
2 315.65 6.18 23.4%
4 391.59 10.12 11.0%
8 382.99 20.55 10.6%
15 458.07 30.64 10.4%
30 489.66 60.62 17.9%
60 490.13 60.55 18.3%

CPU stays at 10-23% regardless of stream count. Throughput plateaus at ~490 FPS while latency degrades 10x.

3. nstreams Sweep (c2-standard-8)

nstreams FPS Median Latency (ms) CPU %
1 72.43 13.62 42.3%
2 77.14 25.76 41.7%
4 80.76 49.32 40.9%
8 80.68 49.36 40.9%

Utilization is 2-4x higher on the smaller machine for the same stream counts.

4. nthreads Sweep (c2-standard-60, single stream)

nthreads FPS Median Latency (ms) CPU %
4 64.36 15.36 7.6%
8 115.30 8.47 11.6%
15 155.80 6.21 17.6%
30 186.03 5.14 29.6%
60 185.15 5.12 29.8%

Scaling stops at 30 threads (= physical core count). Hyperthreading (60 threads) provides zero improvement.

5. nthreads Sweep (c2-standard-8, single stream)

nthreads FPS Median Latency (ms) CPU %
1 19.98 49.94 13.3%
2 38.53 25.84 24.7%
4 72.60 13.61 42.2%
8 72.74 13.54 42.5%

Same pattern - scaling stops at physical core count (4). But utilization at peak is 42% vs 30% on the larger machine.

6. Batch Size Impact on CPU Utilization (c2-standard-60, nstreams=2)

Batch FPS Median Latency (ms) CPU %
1 315.32 6.13 23.4%
4 427.06 18.45 35.9%
8 460.81 34.35 37.9%
16 480.40 64.57 43.8%

Batching increases CPU utilization (operators have more work per invocation), but throughput gains are modest. Confirms the bottleneck is at the operator parallelism level.

7. Batch Size (c2-standard-8, nstreams=2)

Batch FPS Median Latency (ms) CPU %
1 77.24 25.71 41.8%
4 81.93 97.42 47.4%
8 80.06 199.01 48.6%

Similar pattern at smaller scale - batching helps CPU utilization but not throughput.

8. NUMA Analysis (c2-standard-60, 2 NUMA nodes)

Configuration FPS Median Latency Total CPU Node 0 Node 1
Baseline (nstreams=2) 316.31 6.16 ms 23.2% 23.0% 23.4%
numactl --interleave=all 273.00 7.05 ms 26.2% 25.0% 27.3%
numactl --cpunodebind=0,1 --membind=0,1 310.76 6.17 ms 23.4% 23.1% 23.5%
Dual instance (1 per node) 240.98 N/A 33.5% 33.3% 33.4%

Both NUMA nodes are equally utilized in all configurations. numactl --interleave=all actually hurts performance (-14% FPS). This rules out NUMA scheduling as a factor.

Root Cause Analysis

We systematically ruled out external causes:

Hypothesis Result
NUMA imbalance Ruled out - both nodes equally utilized (23.0% vs 23.4%)
OS scheduling Ruled out - numactl binding has no effect
Insufficient threads Ruled out - 60 threads performs same as 30
Insufficient concurrency Ruled out - more streams don't increase CPU
Operator granularity Confirmed - batch size increases CPU (23% -> 44%) without proportional throughput gain

The evidence points to a runtime-level parallelism ceiling: OpenVINO's thread pool and operator scheduling cannot saturate 30+ cores regardless of configuration.

Even if individual operators can't scale beyond ~15 threads, the runtime should compensate by running more concurrent inference streams. Currently, increasing nstreams from 2 to 60 adds no CPU utilization - it only increases latency.

Impact

  • 7.5x more vCPUs yields only 2-6x throughput depending on hint mode
  • ~80% of CPU capacity is idle on the 60-vCPU machine
  • Running multiple smaller instances would be more cost-effective

Requested Actions

  1. Investigate thread pool scaling on high-core-count machines - Why doesn't the runtime generate enough parallel work to use available cores?
  2. Consider adaptive stream count - When operators can't utilize all threads within a single stream, automatically increase concurrent streams
  3. Profile benchmark_app on 30+ core machines - Identify the serialization point preventing >23% utilization

Issue submission checklist

  • I'm reporting a performance issue. It's not a question.
  • I checked the problem with the documentation, FAQ, open issues, Stack Overflow, etc., and have not found a solution.
  • There is reproducer code and related data files such as images, videos, models, etc.

Metadata

Metadata

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions