Skip to content

perf: overhaul CVE ingestion with parallel parsing and topological bulk writes#970

Open
DarshanCode2005 wants to merge 4 commits intoNixOS:mainfrom
DarshanCode2005:cve-ingestion-performance
Open

perf: overhaul CVE ingestion with parallel parsing and topological bulk writes#970
DarshanCode2005 wants to merge 4 commits intoNixOS:mainfrom
DarshanCode2005:cve-ingestion-performance

Conversation

@DarshanCode2005
Copy link
Copy Markdown
Contributor

@DarshanCode2005 DarshanCode2005 commented Apr 1, 2026

Resolves #16

Description

This PR optimizes the CVE ingestion pipeline (both bulk and delta), achieving a 3.5x–5x performance increase on localhost. The previous implementation was bottlenecked by single-core JSON parsing and synchronous, row-level database interactions specifically N+1 save() calls and expensive full-text search triggers.

By introducing a dedicated bulk-write context and parallelizing the CPU-bound parsing stage, throughput has scaled from ~130–200 CVEs/sec to ~690+ CVEs/sec locally.


Thought Process and Architectural Rational

1. Solving the N+1 Ingestion Bottleneck

The core issue was that each CVE record, along with its associated Description, AffectedProduct, Version, and Reference objects, was being saved one-by-one. In a database with triggers and constraints, this overhead accumulates per row.

Solution: CveBulkContext was implemented as a memory-resident buffer that collects model instances and M2M link tuples. Instead of immediate saving, it accumulates a batch of CVEs and then flushes them to PostgreSQL using bulk_create inside a single transaction.

2. Topological Dependency Ordering

Bulk insertion requires strict ordering due to foreign key constraints. The flush order is:

  • Leaf nodes (deduplicated): Tag, Platform, Cpe, Organization
  • Base entities: Description, Metric, Reference, Event
  • Parent entities: CveRecord
  • Children entities: Container, AffectedProduct
  • M2M links: all relationship mappings

3. Suppressing PostgreSQL Triggers

The pgtrigger search vector updates on Description, Container, and AffectedProduct were the single biggest CPU bottleneck on the DB side. By wrapping the bulk flush in pgtrigger.ignore(), row-level indexing is bypassed during ingestion. A single update_search_vectors() call at the end of ingestion updates all indices in aggregate, which is significantly faster than per-row trigger execution.

4. Parallel CPU-Bound Parsing

JSON parsing and object instantiation are CPU-bound tasks. On a single thread, available cores were left idle. ProcessPoolExecutor is used in the management command to chunk the file list across worker processes that return populated CveBulkContext payloads. The main process handles serial database writes, effectively overlapping the CPU-heavy parsing stage with the IO-heavy DB write stage.


Changes

Core logic (src/shared/bulk_ingestion.py):

  • Implemented CveBulkContext dataclass for memory buffering
  • Added flush() for atomic, ordered bulk writes
  • Implemented update_search_vectors() using aggregate SQL updates
  • Refactored prepare_organization to defer DB lookups, enabling worker process isolation
  • Strict type hinting with Sequence and Mapping to handle model covariance

Management commands (src/shared/management/commands/):

  • ingest_bulk_cve.py: Rewritten to use ProcessPoolExecutor with a chunked worker pattern
  • ingest_delta_cve.py: Updated to use CveBulkContext and deferred search vector updates, ensuring consistency between full and incremental syncs

Testing (src/shared/tests/):

  • test_bulk_ingestion.py [NEW]: Comprehensive suite verifying topological sort order and M2M link consistency
  • test_ingest_bulk_cve.py: Updated to support mocked ProcessPoolExecutor (via ThreadPoolExecutor for test stability) and patched database contexts

Performance

Path Throughput Notes
Baseline ~130–200 CVE/s Single-threaded, synchronous triggers, row-level saves
This PR (localhost) ~691 CVE/s Parallel parsing, bulk writes, suppressed triggers
This PR (staged, projected) ~1000+ CVE/s DB network latency makes bulk batching significantly more impactful

Final ingestion log (full dataset, 2019–2025, localhost):

INFO 2026-04-01 14:21:04,711 ingest_bulk_cve 45299 Fetched latest release: CVE 2026-04-01_1300Z
INFO 2026-04-01 14:21:06,838 ingest_bulk_cve 45299 Flushing batch of 2293 CVEs...
INFO 2026-04-01 14:21:12,664 ingest_bulk_cve 45299 Flushing batch of 4589 CVEs...
INFO 2026-04-01 14:21:22,778 ingest_bulk_cve 45299 Flushing batch of 4999 CVEs...
INFO 2026-04-01 14:21:35,803 ingest_bulk_cve 45299 Flushing batch of 4997 CVEs...
INFO 2026-04-01 14:21:51,275 ingest_bulk_cve 45299 Flushing batch of 4997 CVEs...
...
INFO 2026-04-01 14:25:38,386 ingest_bulk_cve 45299 189321 CVEs ingested.
INFO 2026-04-01 14:25:38,386 ingest_bulk_cve 45299 Updating search vectors in bulk...
INFO 2026-04-01 14:25:38,555 ingest_bulk_cve 45299 Search vectors updated successfully.
INFO 2026-04-01 14:25:38,555 ingest_bulk_cve 45299 Saving the ingestion valid up to 2026-04-01

Total: 189,321 CVEs in 274 seconds (~691 CVE/s)

The previous ingestion pattern saved and linked objects one-by-one, forcing thousands of database round-trips one for every model and M2M row-killing throughput on real-world datasets.

This commit introduces a buffering container that collects unsaved Django models and their M2M relationship tuples in memory. By accumulating objects and flushing them in a single transaction.atomic() block , we can use bulk_create to write data in topological order (CVEs -> Descriptions etcetera.) effectively consolidating thousands of individual 'INSERT' calls into a handful of efficient batch operations.
JSON parsing and Django object instantiation are CPU-bound tasks that bottleneck the primary ingestion thread. With datasets exceeing tens of thousands of files, single-threaded processing fails to saturate modern multi-core systems.

This commit overhauls ingest_bulk_cve.py to use 'ProcessPoolExecuter'. We now chunk the file list and map across available CPU cores. This architecture allows use to overlap the heavy parsing of one batch with the database I/O of the previous one, achieving a sustained throughput of ~600+ CVEs/sec
Ensures that the daily delta ingestion process benefits from the same high-
performance indexing strategy as the bulk command by integrating update_search_vectors().
Additionally, this commit provides:
1. A new test suite () verifying M2M topological logic.
2. Updated  tests that correctly mock the new parallel
   architecture and database contexts.
3. Some final formatting changes done.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

CVE fetchers should work using a bulk saving context or perform bulk_create

1 participant