ferro_io

Status: Discontinued (April 2026)

This project is no longer under active development.

ferro_io was an experiment in shipping a Tokio-backed drop-in asyncio replacement for Python. Empirical benchmarking against real-world Python workloads showed that the project, as framed, does not solve the problem it claimed to solve. The code is left public as a reference implementation of PyO3 0.28 + pyo3-async-runtimes 0.28 + Tokio bridging, but do not use it in production and do not expect updates.

See Findings below for the full write-up of what was tested and why the premise does not hold on current CPython.

If you want to continue the architectural bet (multi-thread async runtime positioned for PEP 703 free-threaded Python), fork freely — the MIT license stands.

Findings

Original claim

"CPython's GIL serializes asyncio onto one core. ferro_io moves the scheduler into Tokio with a multi-thread runtime, so CPU workloads bypass the GIL and IO workloads stay at the theoretical sleep floor."

Headline benchmark reported 947× vs stdlib on a "14 CPU chains × 5M LCG iters" workload.

What the benchmark actually measured

The 947× number comes from AsyncRuntime.map_blocking, whose signature is fixed to (items: Vec<u64>, iterations: u64) -> Vec<u64> (see src/lib.rs lines 118–146). The "Python" side of the comparison is a Python LCG loop; the "ferro_io" side is a Rust LCG loop run via tokio::task::spawn_blocking.

This measures Rust arithmetic vs Python arithmetic. It does not measure runtime-vs-runtime. Any Rust extension beats Python arithmetic by a similar factor; this is a property of Rust, not of ferro_io.

Users cannot plug their own Rust kernels into map_blocking — the API is locked to u64 in, u64 out. So the headline win is not accessible from arbitrary Python code.

Real-world benchmark (bcrypt / Pillow / ReportLab / WeasyPrint)

Benchmarked on a real application's run_in_cpu_thread hot paths: bcrypt password hashing, Pillow image resize, ReportLab PDF generation, WeasyPrint HTML→PDF. Each run compared stdlib asyncio default executor against ferro_io.install() + AsyncRuntime(worker_threads=8), in separate processes so install()'s global sys.modules patch could not contaminate the baseline. Serial runs were best-of-5 after a warmup; concurrent runs used asyncio.gather at N = 20, 100, 300, 1000.

Workload	Serial latency (ferro_io vs stdlib)	Concurrent throughput (ferro_io vs stdlib)
bcrypt cost-10	tied (< 1 ms)	tied across all N
Pillow resize 2000→800	tied	tied after warmup (cold-pool artifact on first batch)
ReportLab invoice PDF	tied	tied
WeasyPrint MITC PDF	tied	tied; both runtimes segfault stochastically at N ≥ 100 (Pango/cairo not thread-safe — a C-stack issue no Python runtime can fix)

No measurable throughput or latency win. The one gap that looked like a ferro_io win on the first Pillow batch (~65% faster) was pre-warmed-pool behavior and disappears on subsequent runs; a prestart_executor_threads call on stdlib closes it for free.

Why the premise does not hold on current CPython

The GIL is a CPython interpreter invariant. Any library that runs Python bytecode in threads hits it. Moving the scheduler to Rust/Tokio does not remove this constraint — the scheduler is microseconds; the bytecode is milliseconds. Wrong layer of attack.
C extensions that release the GIL already parallelize under stdlib's ThreadPoolExecutor. bcrypt, Pillow (libjpeg), zstd, orjson all drop the GIL in their C code. A different scheduler doesn't help them — the thread pool is the parallelism mechanism, not the loop.
Pure-Python CPU work stays GIL-bound regardless of runtime. ReportLab is the clean example: Python bytecode, GIL held throughout, identical timings on stdlib and ferro_io.
IO workloads are wall-clock dominated. A 50 ms network RTT is 50 ms under any runtime. ferro_io, uvloop, and stdlib all sit at the sleep floor because they cannot beat physics.
map_blocking's spawn_blocking trick only wins when the closure contains pure Rust code. The moment you call back into Python inside spawn_blocking, the GIL re-acquires and the gain vanishes. ferro_io exposes no generic mechanism for users to write that pure-Rust closure.

The architectural bet (Tokio multi-thread runtime scheduling Python) would become meaningful on PEP 703 free-threaded Python (python3.13t, python3.14t), where the GIL is genuinely removable. ferro_io was not audited or tested against free-threaded builds during its active period. Anyone forking this work should start there.

What was real

PyO3 0.28 + pyo3-async-runtimes 0.28 bridging is correct and well- covered. 307 unit tests + 6 heavyweight integration tests (asyncpg, SQLAlchemy async, FastAPI / Starlette) pass against real services.
Symbol-level drop-in compatibility is real: 119/119 public asyncio symbols resolve through ferro_io via the __getattr__ fallthrough.
FastTaskGroup (Python-side optimization using eager_task_factory + inlined fast path for eagerly-completed tasks) shows 3.26× stdlib and 1.60× uvloop on synthetic 10k-spawn benches. Useful only at that scale; invisible at typical application task counts.
The code is a reasonable reference for PyO3 + Tokio + asyncio bridging if you are learning the stack.

What was not real

"947× faster" as a general claim about the runtime.
"Multi-core async runtime for Python" — the runtime is multi-core; the Python code it runs is not, while the GIL remains.
"Drop-in asyncio replacement that's faster" — drop-in replacement, yes. Faster for real workloads, no.

Original documentation (archived)

The sections below reflect the pre-retirement README. They are preserved so the original positioning and benchmark methodology remain reviewable.

Two modes

# Mode 1 — library-level
import ferro_io as asyncio
asyncio.run(main())

# Mode 2 — process-wide (third-party libs also benefit)
import ferro_io
ferro_io.install()
import aiofiles            # now uses ferro_io under the hood

Original benchmarks (M-series Mac, 14 cores)

Workload	stdlib asyncio	uvloop	ferro_io	vs stdlib
50 × 50ms IO sleep	51.62 ms	51.57 ms	51.44 ms	~1× (sleep floor)
200 × 20ms IO sleep	22.94 ms	22.00 ms	21.23 ms	~1× (sleep floor)
14 CPU chains × 5M LCG iters	6036 ms	5578 ms	6.37 ms	947×

Read the last row in light of the Findings section above: the "CPU chain" workload is Rust arithmetic on one side vs Python arithmetic on the other, not a runtime comparison.

Coverage (as tested during active development)

Symbol-level: 119/119 asyncio public symbols resolve through ferro_io.
Real-world programs verified: TaskGroup, subprocess pipelines, TCP client/server with streams, to_thread, Queue producer/consumer, gather with return_exceptions, wait_for, timeout, Runner with contextvars.
Third-party library smoke test: aiofiles works under ferro_io.install().
Heavyweight libraries verified against real services under ferro_io.install(): asyncpg (Cython records, prepared statements, transactions, pools), SQLAlchemy async (greenlet sync→async bridge over asyncpg), FastAPI / Starlette (anyio, contextvars middleware, to_thread). See tests/test_heavyweights.py.
307 unit tests + 6 heavyweight integration tests, 0 skipped, 0 failed.

Build (reference only)

python -m venv .venv && source .venv/bin/activate
pip install maturin pytest pytest-asyncio
maturin develop --release
pytest tests/ -v
python benchmarks/bench.py

Built with PyO3 0.28 + pyo3-async-runtimes 0.28 + Tokio + maturin. ABI3 wheels covered Python 3.9+.

License

MIT. See LICENSE (if present) or the repository root.

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
.github/workflows		.github/workflows
benchmarks		benchmarks
python/ferro_io		python/ferro_io
src		src
tests		tests
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ferro_io

Status: Discontinued (April 2026)

Findings

Original claim

What the benchmark actually measured

Real-world benchmark (bcrypt / Pillow / ReportLab / WeasyPrint)

Why the premise does not hold on current CPython

What was real

What was not real

Original documentation (archived)

Two modes

Original benchmarks (M-series Mac, 14 cores)

Coverage (as tested during active development)

Build (reference only)

License

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

ferro_io

Status: Discontinued (April 2026)

Findings

Original claim

What the benchmark actually measured

Real-world benchmark (bcrypt / Pillow / ReportLab / WeasyPrint)

Why the premise does not hold on current CPython

What was real

What was not real

Original documentation (archived)

Two modes

Original benchmarks (M-series Mac, 14 cores)

Coverage (as tested during active development)

Build (reference only)

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages