OpenVINO Version
2025.4.0
Operating System
Ubuntu 22.04 (LTS)
Device used for inference
CPU
OpenVINO installation
Docker
Programming Language
C++
Hardware Architecture
x86 (64 bits)
Model used
ResNet50
Model quantization
Yes
Target Platform
- GCP c2-standard-8
- GCP c2-standard-60
Performance issue description
OpenVINO's benchmark_app fails to scale CPU utilization beyond ~23% on a 60-vCPU machine, while achieving 56% on a 4-core machine. This is fully reproducible with the public ResNet-50 model.
Step-by-step reproduction
Environment
| Machine |
Phys Cores |
vCPUs |
Sockets |
NUMA Nodes |
CPU |
| GCP c2-standard-60 |
30 |
60 |
2 |
2 |
Intel Xeon @ 3.10GHz |
| GCP c2-standard-8 |
4 |
8 |
1 |
1 |
Intel Xeon @ 3.10GHz (same SKU) |
Container: openvino/ubuntu24_dev:2025.4.0 via podman
Model: ResNet-50 v1.7 - 98 MB, 224x224 input
Source: https://github.com/onnx/models/raw/main/validated/vision/classification/resnet/model/resnet50-v1-7.onnx
Reproduction Steps
wget https://github.com/onnx/models/raw/main/validated/vision/classification/resnet/model/resnet50-v1-7.onnx
# On a machine with 30+ physical cores:
podman run --rm \
-v $(pwd):/models:ro \
openvino/ubuntu24_dev:2025.4.0 \
benchmark_app -m /models/resnet50-v1-7.onnx -d CPU -niter 200 \
-shape "[1,3,224,224]" -hint throughput
# Monitor CPU in a separate terminal:
mpstat 1
# Observed: ~14-23% CPU utilization, ~489 FPS (on 60-vCPU machine)
# Expected: >60% CPU utilization
Data
1. Machine Size Comparison
| Machine |
vCPUs |
Config |
FPS |
Latency |
CPU % |
FPS/vCPU |
| c2-standard-8 |
8 |
hint=throughput |
85.55 |
45.91 ms |
56.4% |
10.69 |
| c2-standard-60 |
60 |
hint=throughput |
488.70 |
60.56 ms |
13.9% |
8.15 |
| c2-standard-8 |
8 |
hint=latency |
72.62 |
13.59 ms |
42.4% |
9.08 |
| c2-standard-60 |
60 |
hint=latency |
158.93 |
6.06 ms |
17.6% |
2.65 |
7.5x more vCPUs yields 5.7x throughput (hint=throughput) and only 2.2x (hint=latency). Per-vCPU efficiency drops 24-71% on the larger machine.
2. nstreams Sweep (c2-standard-60)
| nstreams |
FPS |
Median Latency (ms) |
CPU % |
| 1 |
158.86 |
6.10 |
17.7% |
| 2 |
315.65 |
6.18 |
23.4% |
| 4 |
391.59 |
10.12 |
11.0% |
| 8 |
382.99 |
20.55 |
10.6% |
| 15 |
458.07 |
30.64 |
10.4% |
| 30 |
489.66 |
60.62 |
17.9% |
| 60 |
490.13 |
60.55 |
18.3% |
CPU stays at 10-23% regardless of stream count. Throughput plateaus at ~490 FPS while latency degrades 10x.
3. nstreams Sweep (c2-standard-8)
| nstreams |
FPS |
Median Latency (ms) |
CPU % |
| 1 |
72.43 |
13.62 |
42.3% |
| 2 |
77.14 |
25.76 |
41.7% |
| 4 |
80.76 |
49.32 |
40.9% |
| 8 |
80.68 |
49.36 |
40.9% |
Utilization is 2-4x higher on the smaller machine for the same stream counts.
4. nthreads Sweep (c2-standard-60, single stream)
| nthreads |
FPS |
Median Latency (ms) |
CPU % |
| 4 |
64.36 |
15.36 |
7.6% |
| 8 |
115.30 |
8.47 |
11.6% |
| 15 |
155.80 |
6.21 |
17.6% |
| 30 |
186.03 |
5.14 |
29.6% |
| 60 |
185.15 |
5.12 |
29.8% |
Scaling stops at 30 threads (= physical core count). Hyperthreading (60 threads) provides zero improvement.
5. nthreads Sweep (c2-standard-8, single stream)
| nthreads |
FPS |
Median Latency (ms) |
CPU % |
| 1 |
19.98 |
49.94 |
13.3% |
| 2 |
38.53 |
25.84 |
24.7% |
| 4 |
72.60 |
13.61 |
42.2% |
| 8 |
72.74 |
13.54 |
42.5% |
Same pattern - scaling stops at physical core count (4). But utilization at peak is 42% vs 30% on the larger machine.
6. Batch Size Impact on CPU Utilization (c2-standard-60, nstreams=2)
| Batch |
FPS |
Median Latency (ms) |
CPU % |
| 1 |
315.32 |
6.13 |
23.4% |
| 4 |
427.06 |
18.45 |
35.9% |
| 8 |
460.81 |
34.35 |
37.9% |
| 16 |
480.40 |
64.57 |
43.8% |
Batching increases CPU utilization (operators have more work per invocation), but throughput gains are modest. Confirms the bottleneck is at the operator parallelism level.
7. Batch Size (c2-standard-8, nstreams=2)
| Batch |
FPS |
Median Latency (ms) |
CPU % |
| 1 |
77.24 |
25.71 |
41.8% |
| 4 |
81.93 |
97.42 |
47.4% |
| 8 |
80.06 |
199.01 |
48.6% |
Similar pattern at smaller scale - batching helps CPU utilization but not throughput.
8. NUMA Analysis (c2-standard-60, 2 NUMA nodes)
| Configuration |
FPS |
Median Latency |
Total CPU |
Node 0 |
Node 1 |
| Baseline (nstreams=2) |
316.31 |
6.16 ms |
23.2% |
23.0% |
23.4% |
numactl --interleave=all |
273.00 |
7.05 ms |
26.2% |
25.0% |
27.3% |
numactl --cpunodebind=0,1 --membind=0,1 |
310.76 |
6.17 ms |
23.4% |
23.1% |
23.5% |
| Dual instance (1 per node) |
240.98 |
N/A |
33.5% |
33.3% |
33.4% |
Both NUMA nodes are equally utilized in all configurations. numactl --interleave=all actually hurts performance (-14% FPS). This rules out NUMA scheduling as a factor.
Root Cause Analysis
We systematically ruled out external causes:
| Hypothesis |
Result |
| NUMA imbalance |
Ruled out - both nodes equally utilized (23.0% vs 23.4%) |
| OS scheduling |
Ruled out - numactl binding has no effect |
| Insufficient threads |
Ruled out - 60 threads performs same as 30 |
| Insufficient concurrency |
Ruled out - more streams don't increase CPU |
| Operator granularity |
Confirmed - batch size increases CPU (23% -> 44%) without proportional throughput gain |
The evidence points to a runtime-level parallelism ceiling: OpenVINO's thread pool and operator scheduling cannot saturate 30+ cores regardless of configuration.
Even if individual operators can't scale beyond ~15 threads, the runtime should compensate by running more concurrent inference streams. Currently, increasing nstreams from 2 to 60 adds no CPU utilization - it only increases latency.
Impact
- 7.5x more vCPUs yields only 2-6x throughput depending on hint mode
- ~80% of CPU capacity is idle on the 60-vCPU machine
- Running multiple smaller instances would be more cost-effective
Requested Actions
- Investigate thread pool scaling on high-core-count machines - Why doesn't the runtime generate enough parallel work to use available cores?
- Consider adaptive stream count - When operators can't utilize all threads within a single stream, automatically increase concurrent streams
- Profile
benchmark_app on 30+ core machines - Identify the serialization point preventing >23% utilization
Issue submission checklist
OpenVINO Version
2025.4.0
Operating System
Ubuntu 22.04 (LTS)
Device used for inference
CPU
OpenVINO installation
Docker
Programming Language
C++
Hardware Architecture
x86 (64 bits)
Model used
ResNet50
Model quantization
Yes
Target Platform
Performance issue description
OpenVINO's
benchmark_appfails to scale CPU utilization beyond ~23% on a 60-vCPU machine, while achieving 56% on a 4-core machine. This is fully reproducible with the public ResNet-50 model.Step-by-step reproduction
Environment
Container:
openvino/ubuntu24_dev:2025.4.0via podmanModel: ResNet-50 v1.7 - 98 MB, 224x224 input
Source:
https://github.com/onnx/models/raw/main/validated/vision/classification/resnet/model/resnet50-v1-7.onnxReproduction Steps
Data
1. Machine Size Comparison
7.5x more vCPUs yields 5.7x throughput (hint=throughput) and only 2.2x (hint=latency). Per-vCPU efficiency drops 24-71% on the larger machine.
2. nstreams Sweep (c2-standard-60)
CPU stays at 10-23% regardless of stream count. Throughput plateaus at ~490 FPS while latency degrades 10x.
3. nstreams Sweep (c2-standard-8)
Utilization is 2-4x higher on the smaller machine for the same stream counts.
4. nthreads Sweep (c2-standard-60, single stream)
Scaling stops at 30 threads (= physical core count). Hyperthreading (60 threads) provides zero improvement.
5. nthreads Sweep (c2-standard-8, single stream)
Same pattern - scaling stops at physical core count (4). But utilization at peak is 42% vs 30% on the larger machine.
6. Batch Size Impact on CPU Utilization (c2-standard-60, nstreams=2)
Batching increases CPU utilization (operators have more work per invocation), but throughput gains are modest. Confirms the bottleneck is at the operator parallelism level.
7. Batch Size (c2-standard-8, nstreams=2)
Similar pattern at smaller scale - batching helps CPU utilization but not throughput.
8. NUMA Analysis (c2-standard-60, 2 NUMA nodes)
numactl --interleave=allnumactl --cpunodebind=0,1 --membind=0,1Both NUMA nodes are equally utilized in all configurations.
numactl --interleave=allactually hurts performance (-14% FPS). This rules out NUMA scheduling as a factor.Root Cause Analysis
We systematically ruled out external causes:
numactlbinding has no effectThe evidence points to a runtime-level parallelism ceiling: OpenVINO's thread pool and operator scheduling cannot saturate 30+ cores regardless of configuration.
Even if individual operators can't scale beyond ~15 threads, the runtime should compensate by running more concurrent inference streams. Currently, increasing
nstreamsfrom 2 to 60 adds no CPU utilization - it only increases latency.Impact
Requested Actions
benchmark_appon 30+ core machines - Identify the serialization point preventing >23% utilizationIssue submission checklist