[Performance]: OpenVINO CPU Scaling Bottleneck

### OpenVINO Version

2025.4.0

### Operating System

Ubuntu 22.04 (LTS)

### Device used for inference

CPU

### OpenVINO installation

Docker

### Programming Language

C++

### Hardware Architecture

x86 (64 bits)

### Model used

ResNet50

### Model quantization

Yes

### Target Platform

* GCP c2-standard-8
* GCP c2-standard-60

### Performance issue description

OpenVINO's `benchmark_app` fails to scale CPU utilization beyond ~23% on a 60-vCPU machine, while achieving 56% on a 4-core machine. This is fully reproducible with the public ResNet-50 model.

### Step-by-step reproduction

## Environment

| Machine | Phys Cores | vCPUs | Sockets | NUMA Nodes | CPU |
|---------|-----------|-------|---------|------------|-----|
| GCP c2-standard-60 | 30 | 60 | 2 | 2 | Intel Xeon @ 3.10GHz |
| GCP c2-standard-8 | 4 | 8 | 1 | 1 | Intel Xeon @ 3.10GHz (same SKU) |

**Container:** `openvino/ubuntu24_dev:2025.4.0` via podman
**Model:** ResNet-50 v1.7 - 98 MB, 224x224 input
**Source:** `https://github.com/onnx/models/raw/main/validated/vision/classification/resnet/model/resnet50-v1-7.onnx`

## Reproduction Steps

```bash
wget https://github.com/onnx/models/raw/main/validated/vision/classification/resnet/model/resnet50-v1-7.onnx

# On a machine with 30+ physical cores:
podman run --rm \
  -v $(pwd):/models:ro \
  openvino/ubuntu24_dev:2025.4.0 \
  benchmark_app -m /models/resnet50-v1-7.onnx -d CPU -niter 200 \
  -shape "[1,3,224,224]" -hint throughput

# Monitor CPU in a separate terminal:
mpstat 1

# Observed:  ~14-23% CPU utilization, ~489 FPS (on 60-vCPU machine)
# Expected:  >60% CPU utilization
```

## Data

### 1. Machine Size Comparison

| Machine | vCPUs | Config | FPS | Latency | CPU % | FPS/vCPU |
|---------|-------|--------|-----|---------|-------|----------|
| c2-standard-8 | 8 | hint=throughput | 85.55 | 45.91 ms | **56.4%** | 10.69 |
| c2-standard-60 | 60 | hint=throughput | 488.70 | 60.56 ms | **13.9%** | 8.15 |
| c2-standard-8 | 8 | hint=latency | 72.62 | 13.59 ms | **42.4%** | 9.08 |
| c2-standard-60 | 60 | hint=latency | 158.93 | 6.06 ms | **17.6%** | 2.65 |

7.5x more vCPUs yields 5.7x throughput (hint=throughput) and only 2.2x (hint=latency). Per-vCPU efficiency drops 24-71% on the larger machine.

### 2. nstreams Sweep (c2-standard-60)

| nstreams | FPS | Median Latency (ms) | CPU % |
|----------|-----|---------------------|-------|
| 1 | 158.86 | 6.10 | 17.7% |
| 2 | 315.65 | 6.18 | 23.4% |
| 4 | 391.59 | 10.12 | 11.0% |
| 8 | 382.99 | 20.55 | 10.6% |
| 15 | 458.07 | 30.64 | 10.4% |
| 30 | 489.66 | 60.62 | 17.9% |
| 60 | 490.13 | 60.55 | 18.3% |

CPU stays at 10-23% regardless of stream count. Throughput plateaus at ~490 FPS while latency degrades 10x.

### 3. nstreams Sweep (c2-standard-8)

| nstreams | FPS | Median Latency (ms) | CPU % |
|----------|-----|---------------------|-------|
| 1 | 72.43 | 13.62 | 42.3% |
| 2 | 77.14 | 25.76 | 41.7% |
| 4 | 80.76 | 49.32 | 40.9% |
| 8 | 80.68 | 49.36 | 40.9% |

Utilization is 2-4x higher on the smaller machine for the same stream counts.

### 4. nthreads Sweep (c2-standard-60, single stream)

| nthreads | FPS | Median Latency (ms) | CPU % |
|----------|-----|---------------------|-------|
| 4 | 64.36 | 15.36 | 7.6% |
| 8 | 115.30 | 8.47 | 11.6% |
| 15 | 155.80 | 6.21 | 17.6% |
| 30 | 186.03 | 5.14 | **29.6%** |
| 60 | 185.15 | 5.12 | **29.8%** |

Scaling stops at 30 threads (= physical core count). Hyperthreading (60 threads) provides zero improvement.

### 5. nthreads Sweep (c2-standard-8, single stream)

| nthreads | FPS | Median Latency (ms) | CPU % |
|----------|-----|---------------------|-------|
| 1 | 19.98 | 49.94 | 13.3% |
| 2 | 38.53 | 25.84 | 24.7% |
| 4 | 72.60 | 13.61 | **42.2%** |
| 8 | 72.74 | 13.54 | **42.5%** |

Same pattern - scaling stops at physical core count (4). But utilization at peak is 42% vs 30% on the larger machine.

### 6. Batch Size Impact on CPU Utilization (c2-standard-60, nstreams=2)

| Batch | FPS | Median Latency (ms) | CPU % |
|-------|-----|---------------------|-------|
| 1 | 315.32 | 6.13 | 23.4% |
| 4 | 427.06 | 18.45 | 35.9% |
| 8 | 460.81 | 34.35 | 37.9% |
| 16 | 480.40 | 64.57 | **43.8%** |

Batching increases CPU utilization (operators have more work per invocation), but throughput gains are modest. Confirms the bottleneck is at the operator parallelism level.

### 7. Batch Size (c2-standard-8, nstreams=2)

| Batch | FPS | Median Latency (ms) | CPU % |
|-------|-----|---------------------|-------|
| 1 | 77.24 | 25.71 | 41.8% |
| 4 | 81.93 | 97.42 | 47.4% |
| 8 | 80.06 | 199.01 | 48.6% |

Similar pattern at smaller scale - batching helps CPU utilization but not throughput.

### 8. NUMA Analysis (c2-standard-60, 2 NUMA nodes)

| Configuration | FPS | Median Latency | Total CPU | Node 0 | Node 1 |
|---------------|-----|----------------|-----------|--------|--------|
| Baseline (nstreams=2) | 316.31 | 6.16 ms | 23.2% | 23.0% | 23.4% |
| `numactl --interleave=all` | 273.00 | 7.05 ms | 26.2% | 25.0% | 27.3% |
| `numactl --cpunodebind=0,1 --membind=0,1` | 310.76 | 6.17 ms | 23.4% | 23.1% | 23.5% |
| Dual instance (1 per node) | 240.98 | N/A | 33.5% | 33.3% | 33.4% |

Both NUMA nodes are equally utilized in all configurations. `numactl --interleave=all` actually *hurts* performance (-14% FPS). This rules out NUMA scheduling as a factor.

## Root Cause Analysis

We systematically ruled out external causes:

| Hypothesis | Result |
|-----------|--------|
| NUMA imbalance | **Ruled out** - both nodes equally utilized (23.0% vs 23.4%) |
| OS scheduling | **Ruled out** - `numactl` binding has no effect |
| Insufficient threads | **Ruled out** - 60 threads performs same as 30 |
| Insufficient concurrency | **Ruled out** - more streams don't increase CPU |
| Operator granularity | **Confirmed** - batch size increases CPU (23% -> 44%) without proportional throughput gain |

The evidence points to a **runtime-level parallelism ceiling**: OpenVINO's thread pool and operator scheduling cannot saturate 30+ cores regardless of configuration.

Even if individual operators can't scale beyond ~15 threads, the runtime should compensate by running more concurrent inference streams. Currently, increasing `nstreams` from 2 to 60 adds no CPU utilization - it only increases latency.

## Impact

- 7.5x more vCPUs yields only 2-6x throughput depending on hint mode
- ~80% of CPU capacity is idle on the 60-vCPU machine
- Running multiple smaller instances would be more cost-effective

## Requested Actions

1. **Investigate thread pool scaling on high-core-count machines** - Why doesn't the runtime generate enough parallel work to use available cores?
2. **Consider adaptive stream count** - When operators can't utilize all threads within a single stream, automatically increase concurrent streams
3. **Profile `benchmark_app` on 30+ core machines** - Identify the serialization point preventing >23% utilization


### Issue submission checklist

- [x] I'm reporting a performance issue. It's not a question.
- [x] I checked the problem with the documentation, FAQ, open issues, Stack Overflow, etc., and have not found a solution.
- [x] There is reproducer code and related data files such as images, videos, models, etc.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Performance]: OpenVINO CPU Scaling Bottleneck #35133

OpenVINO Version

Operating System

Device used for inference

OpenVINO installation

Programming Language

Hardware Architecture

Model used

Model quantization

Target Platform

Performance issue description

Step-by-step reproduction

Environment

Reproduction Steps

Data

1. Machine Size Comparison

2. nstreams Sweep (c2-standard-60)

3. nstreams Sweep (c2-standard-8)

4. nthreads Sweep (c2-standard-60, single stream)

5. nthreads Sweep (c2-standard-8, single stream)

6. Batch Size Impact on CPU Utilization (c2-standard-60, nstreams=2)

7. Batch Size (c2-standard-8, nstreams=2)

8. NUMA Analysis (c2-standard-60, 2 NUMA nodes)

Root Cause Analysis

Impact

Requested Actions

Issue submission checklist

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Machine	Phys Cores	vCPUs	Sockets	NUMA Nodes	CPU
GCP c2-standard-60	30	60	2	2	Intel Xeon @ 3.10GHz
GCP c2-standard-8	4	8	1	1	Intel Xeon @ 3.10GHz (same SKU)

Machine	vCPUs	Config	FPS	Latency	CPU %	FPS/vCPU
c2-standard-8	8	hint=throughput	85.55	45.91 ms	56.4%	10.69
c2-standard-60	60	hint=throughput	488.70	60.56 ms	13.9%	8.15
c2-standard-8	8	hint=latency	72.62	13.59 ms	42.4%	9.08
c2-standard-60	60	hint=latency	158.93	6.06 ms	17.6%	2.65

nstreams	FPS	Median Latency (ms)	CPU %
1	158.86	6.10	17.7%
2	315.65	6.18	23.4%
4	391.59	10.12	11.0%
8	382.99	20.55	10.6%
15	458.07	30.64	10.4%
30	489.66	60.62	17.9%
60	490.13	60.55	18.3%

nstreams	FPS	Median Latency (ms)	CPU %
1	72.43	13.62	42.3%
2	77.14	25.76	41.7%
4	80.76	49.32	40.9%
8	80.68	49.36	40.9%

nthreads	FPS	Median Latency (ms)	CPU %
4	64.36	15.36	7.6%
8	115.30	8.47	11.6%
15	155.80	6.21	17.6%
30	186.03	5.14	29.6%
60	185.15	5.12	29.8%

nthreads	FPS	Median Latency (ms)	CPU %
1	19.98	49.94	13.3%
2	38.53	25.84	24.7%
4	72.60	13.61	42.2%
8	72.74	13.54	42.5%

Batch	FPS	Median Latency (ms)	CPU %
1	315.32	6.13	23.4%
4	427.06	18.45	35.9%
8	460.81	34.35	37.9%
16	480.40	64.57	43.8%

Configuration	FPS	Median Latency	Total CPU	Node 0	Node 1
Baseline (nstreams=2)	316.31	6.16 ms	23.2%	23.0%	23.4%
`numactl --interleave=all`	273.00	7.05 ms	26.2%	25.0%	27.3%
`numactl --cpunodebind=0,1 --membind=0,1`	310.76	6.17 ms	23.4%	23.1%	23.5%
Dual instance (1 per node)	240.98	N/A	33.5%	33.3%	33.4%

Hypothesis	Result
NUMA imbalance	Ruled out - both nodes equally utilized (23.0% vs 23.4%)
OS scheduling	Ruled out - `numactl` binding has no effect
Insufficient threads	Ruled out - 60 threads performs same as 30
Insufficient concurrency	Ruled out - more streams don't increase CPU
Operator granularity	Confirmed - batch size increases CPU (23% -> 44%) without proportional throughput gain

[Performance]: OpenVINO CPU Scaling Bottleneck #35133

Description

OpenVINO Version

Operating System

Device used for inference

OpenVINO installation

Programming Language

Hardware Architecture

Model used

Model quantization

Target Platform

Performance issue description

Step-by-step reproduction

Environment

Reproduction Steps

Data

1. Machine Size Comparison

2. nstreams Sweep (c2-standard-60)

3. nstreams Sweep (c2-standard-8)

4. nthreads Sweep (c2-standard-60, single stream)

5. nthreads Sweep (c2-standard-8, single stream)

6. Batch Size Impact on CPU Utilization (c2-standard-60, nstreams=2)

7. Batch Size (c2-standard-8, nstreams=2)

8. NUMA Analysis (c2-standard-60, 2 NUMA nodes)

Root Cause Analysis

Impact

Requested Actions

Issue submission checklist

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions