Skip to content

fix(sdk): resolve exporter deadlock on constrained tokio runtimes#3380

Open
bryantbiggs wants to merge 4 commits intoopen-telemetry:mainfrom
bryantbiggs:fix/blocking-strategy-deadlock
Open

fix(sdk): resolve exporter deadlock on constrained tokio runtimes#3380
bryantbiggs wants to merge 4 commits intoopen-telemetry:mainfrom
bryantbiggs:fix/blocking-strategy-deadlock

Conversation

@bryantbiggs
Copy link
Copy Markdown
Contributor

Summary

  • Add BlockingStrategy that captures the tokio runtime handle at construction and enters the runtime context via Handle::enter() before calling futures_executor::block_on() on dedicated background threads
  • Update BatchSpanProcessor, BatchLogProcessor, and PeriodicReader to use BlockingStrategy instead of bare futures_executor::block_on()
  • Falls back to plain futures_executor::block_on() when no tokio runtime is available

Problem

The default thread-based processors call futures_executor::block_on(exporter.export(...)) on their dedicated worker threads. When the exporter uses tonic/gRPC, the export future depends on tokio tasks (e.g. tonic's Buffer worker spawned via tokio::spawn) that can only be polled by tokio worker threads. If all tokio worker threads are blocked — single-threaded runtime (current_thread), or multi_thread with 1 worker (common in 1-vCPU k8s pods) — this creates a circular wait: the worker thread waits for the export to complete, but the export can't complete because no tokio thread is available to poll the Buffer worker.

Reproduction and detailed analysis in #3356 (comment): #3356 (comment)

Minimal repro gist: https://gist.github.com/bryantbiggs/62737e105525fe341090d0ad97de2178

Scenario force_flush() shutdown()
current_thread Hangs forever Returns Err(Timeout(5s)), worker thread stays stuck
multi_thread(1) + tokio::spawn Hangs forever (entire runtime freezes) Same pattern
multi_thread(default workers) Returns immediately Returns immediately

Solution

BlockingStrategy::new() is called during processor construction (while inside the tokio runtime context). It calls Handle::try_current() to capture the runtime handle. On the dedicated background thread, blocking_strategy.block_on(future) enters the runtime context via Handle::enter() before calling futures_executor::block_on(). This makes tokio types (spawn, timers, IO) available without taking ownership of the reactor — IO continues to be driven by the runtime's own threads.

When no tokio runtime is available (e.g., non-tokio environments), it falls back to plain futures_executor::block_on() — preserving existing behavior.

Scope

This is the scoped-down version of #3356, containing only the bug fix as suggested by @scottgerring in #3356 (comment). The experimental async runtime removal is intentionally excluded and can be discussed separately.

Fixes #2802

Test plan

  • cargo check -p opentelemetry_sdk --all-features
  • cargo check -p opentelemetry_sdk --no-default-features
  • cargo clippy -p opentelemetry_sdk --all-features -- -Dwarnings
  • cargo test -p opentelemetry_sdk --features="testing" — 295 passed, 0 failed, 3 ignored (pre-existing)
  • cargo check -p opentelemetry-otlp --all-features

@codecov
Copy link
Copy Markdown

codecov bot commented Feb 20, 2026

Codecov Report

❌ Patch coverage is 89.51613% with 13 lines in your changes missing coverage. Please review.
✅ Project coverage is 83.3%. Comparing base (9650783) to head (f9608c4).
⚠️ Report is 10 commits behind head on main.

Files with missing lines Patch % Lines
opentelemetry-sdk/src/metrics/periodic_reader.rs 82.3% 6 Missing ⚠️
opentelemetry-sdk/src/logs/batch_log_processor.rs 88.8% 4 Missing ⚠️
opentelemetry-sdk/src/trace/span_processor.rs 92.6% 3 Missing ⚠️
Additional details and impacted files
@@          Coverage Diff           @@
##            main   #3380    +/-   ##
======================================
  Coverage   83.2%   83.3%            
======================================
  Files        128     128            
  Lines      25045   25164   +119     
======================================
+ Hits       20858   20965   +107     
- Misses      4187    4199    +12     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

The default thread-based processors (BatchSpanProcessor, BatchLogProcessor,
PeriodicReader) call futures_executor::block_on() on their dedicated worker
threads. When the exporter uses tonic/gRPC, the export future depends on
tokio tasks (e.g. tonic's Buffer worker) that can only be polled by tokio
worker threads. If all tokio worker threads are blocked (single-threaded
runtime, or multi-thread with 1 worker), this creates a circular wait.

Add BlockingStrategy that captures the tokio runtime handle at construction
time and enters the runtime context via Handle::enter() before calling
futures_executor::block_on(). This makes tokio types available on the
dedicated background threads without taking ownership of the reactor.
Falls back to plain futures_executor::block_on() without tokio.

Fixes: open-telemetry#2802
Add tests with TokioSpawn*Exporter mocks that call tokio::spawn()
inside export(), simulating tonic/gRPC exporters. These prove that
BlockingStrategy correctly provides tokio runtime context on the
processor's dedicated OS thread, preventing deadlocks on constrained
multi_thread(1) runtimes (open-telemetry#2802, open-telemetry#3356).
@bryantbiggs bryantbiggs force-pushed the fix/blocking-strategy-deadlock branch from c3840f7 to ed5ff32 Compare March 6, 2026 01:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

OTLP MetricExporter deadlock issue

1 participant