perf: overhaul CVE ingestion with parallel parsing and topological bulk writes by DarshanCode2005 · Pull Request #970 · NixOS/nix-security-tracker

DarshanCode2005 · 2026-04-01T17:26:15Z

Resolves #16

Description

This PR optimizes the CVE ingestion pipeline (both bulk and delta), achieving a 3.5x–5x performance increase on localhost. The previous implementation was bottlenecked by single-core JSON parsing and synchronous, row-level database interactions specifically N+1 save() calls and expensive full-text search triggers.

By introducing a dedicated bulk-write context and parallelizing the CPU-bound parsing stage, throughput has scaled from ~130–200 CVEs/sec to ~690+ CVEs/sec locally.

Thought Process and Architectural Rational

1. Solving the N+1 Ingestion Bottleneck

The core issue was that each CVE record, along with its associated Description, AffectedProduct, Version, and Reference objects, was being saved one-by-one. In a database with triggers and constraints, this overhead accumulates per row.

Solution: CveBulkContext was implemented as a memory-resident buffer that collects model instances and M2M link tuples. Instead of immediate saving, it accumulates a batch of CVEs and then flushes them to PostgreSQL using bulk_create inside a single transaction.

2. Topological Dependency Ordering

Bulk insertion requires strict ordering due to foreign key constraints. The flush order is:

Leaf nodes (deduplicated): Tag, Platform, Cpe, Organization
Base entities: Description, Metric, Reference, Event
Parent entities: CveRecord
Children entities: Container, AffectedProduct
M2M links: all relationship mappings

3. Suppressing PostgreSQL Triggers

The pgtrigger search vector updates on Description, Container, and AffectedProduct were the single biggest CPU bottleneck on the DB side. By wrapping the bulk flush in pgtrigger.ignore(), row-level indexing is bypassed during ingestion. A single update_search_vectors() call at the end of ingestion updates all indices in aggregate, which is significantly faster than per-row trigger execution.

4. Parallel CPU-Bound Parsing

JSON parsing and object instantiation are CPU-bound tasks. On a single thread, available cores were left idle. ProcessPoolExecutor is used in the management command to chunk the file list across worker processes that return populated CveBulkContext payloads. The main process handles serial database writes, effectively overlapping the CPU-heavy parsing stage with the IO-heavy DB write stage.

Changes

Core logic (src/shared/bulk_ingestion.py):

Implemented CveBulkContext dataclass for memory buffering
Added flush() for atomic, ordered bulk writes
Implemented update_search_vectors() using aggregate SQL updates
Refactored prepare_organization to defer DB lookups, enabling worker process isolation
Strict type hinting with Sequence and Mapping to handle model covariance

Management commands (src/shared/management/commands/):

ingest_bulk_cve.py: Rewritten to use ProcessPoolExecutor with a chunked worker pattern
ingest_delta_cve.py: Updated to use CveBulkContext and deferred search vector updates, ensuring consistency between full and incremental syncs

Testing (src/shared/tests/):

test_bulk_ingestion.py [NEW]: Comprehensive suite verifying topological sort order and M2M link consistency
test_ingest_bulk_cve.py: Updated to support mocked ProcessPoolExecutor (via ThreadPoolExecutor for test stability) and patched database contexts

Performance

Path	Throughput	Notes
Baseline	~130–200 CVE/s	Single-threaded, synchronous triggers, row-level saves
This PR (localhost)	~691 CVE/s	Parallel parsing, bulk writes, suppressed triggers
This PR (staged, projected)	~1000+ CVE/s	DB network latency makes bulk batching significantly more impactful

Final ingestion log (full dataset, 2019–2025, localhost):

INFO 2026-04-01 14:21:04,711 ingest_bulk_cve 45299 Fetched latest release: CVE 2026-04-01_1300Z
INFO 2026-04-01 14:21:06,838 ingest_bulk_cve 45299 Flushing batch of 2293 CVEs...
INFO 2026-04-01 14:21:12,664 ingest_bulk_cve 45299 Flushing batch of 4589 CVEs...
INFO 2026-04-01 14:21:22,778 ingest_bulk_cve 45299 Flushing batch of 4999 CVEs...
INFO 2026-04-01 14:21:35,803 ingest_bulk_cve 45299 Flushing batch of 4997 CVEs...
INFO 2026-04-01 14:21:51,275 ingest_bulk_cve 45299 Flushing batch of 4997 CVEs...
...
INFO 2026-04-01 14:25:38,386 ingest_bulk_cve 45299 189321 CVEs ingested.
INFO 2026-04-01 14:25:38,386 ingest_bulk_cve 45299 Updating search vectors in bulk...
INFO 2026-04-01 14:25:38,555 ingest_bulk_cve 45299 Search vectors updated successfully.
INFO 2026-04-01 14:25:38,555 ingest_bulk_cve 45299 Saving the ingestion valid up to 2026-04-01

Total: 189,321 CVEs in 274 seconds (~691 CVE/s)

The previous ingestion pattern saved and linked objects one-by-one, forcing thousands of database round-trips one for every model and M2M row-killing throughput on real-world datasets. This commit introduces a buffering container that collects unsaved Django models and their M2M relationship tuples in memory. By accumulating objects and flushing them in a single transaction.atomic() block , we can use bulk_create to write data in topological order (CVEs -> Descriptions etcetera.) effectively consolidating thousands of individual 'INSERT' calls into a handful of efficient batch operations.

JSON parsing and Django object instantiation are CPU-bound tasks that bottleneck the primary ingestion thread. With datasets exceeing tens of thousands of files, single-threaded processing fails to saturate modern multi-core systems. This commit overhauls ingest_bulk_cve.py to use 'ProcessPoolExecuter'. We now chunk the file list and map across available CPU cores. This architecture allows use to overlap the heavy parsing of one batch with the database I/O of the previous one, achieving a sustained throughput of ~600+ CVEs/sec

Ensures that the daily delta ingestion process benefits from the same high- performance indexing strategy as the bulk command by integrating update_search_vectors(). Additionally, this commit provides: 1. A new test suite () verifying M2M topological logic. 2. Updated tests that correctly mock the new parallel architecture and database contexts. 3. Some final formatting changes done.

DarshanCode2005 added 3 commits April 1, 2026 22:04

github-project-automation bot added this to Nixpkgs Vulnerability Tracker Apr 1, 2026

Merge branch 'main' into cve-ingestion-performance

78c5655

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: overhaul CVE ingestion with parallel parsing and topological bulk writes#970

perf: overhaul CVE ingestion with parallel parsing and topological bulk writes#970
DarshanCode2005 wants to merge 4 commits intoNixOS:mainfrom
DarshanCode2005:cve-ingestion-performance

DarshanCode2005 commented Apr 1, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

DarshanCode2005 commented Apr 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

DarshanCode2005 commented Apr 1, 2026 •

edited

Loading