perf: overhaul CVE ingestion with parallel parsing and topological bulk writes#970
Open
DarshanCode2005 wants to merge 4 commits intoNixOS:mainfrom
Open
perf: overhaul CVE ingestion with parallel parsing and topological bulk writes#970DarshanCode2005 wants to merge 4 commits intoNixOS:mainfrom
DarshanCode2005 wants to merge 4 commits intoNixOS:mainfrom
Conversation
The previous ingestion pattern saved and linked objects one-by-one, forcing thousands of database round-trips one for every model and M2M row-killing throughput on real-world datasets. This commit introduces a buffering container that collects unsaved Django models and their M2M relationship tuples in memory. By accumulating objects and flushing them in a single transaction.atomic() block , we can use bulk_create to write data in topological order (CVEs -> Descriptions etcetera.) effectively consolidating thousands of individual 'INSERT' calls into a handful of efficient batch operations.
JSON parsing and Django object instantiation are CPU-bound tasks that bottleneck the primary ingestion thread. With datasets exceeing tens of thousands of files, single-threaded processing fails to saturate modern multi-core systems. This commit overhauls ingest_bulk_cve.py to use 'ProcessPoolExecuter'. We now chunk the file list and map across available CPU cores. This architecture allows use to overlap the heavy parsing of one batch with the database I/O of the previous one, achieving a sustained throughput of ~600+ CVEs/sec
Ensures that the daily delta ingestion process benefits from the same high- performance indexing strategy as the bulk command by integrating update_search_vectors(). Additionally, this commit provides: 1. A new test suite () verifying M2M topological logic. 2. Updated tests that correctly mock the new parallel architecture and database contexts. 3. Some final formatting changes done.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Resolves #16
Description
This PR optimizes the CVE ingestion pipeline (both bulk and delta), achieving a 3.5x–5x performance increase on localhost. The previous implementation was bottlenecked by single-core JSON parsing and synchronous, row-level database interactions specifically N+1
save()calls and expensive full-text search triggers.By introducing a dedicated bulk-write context and parallelizing the CPU-bound parsing stage, throughput has scaled from ~130–200 CVEs/sec to ~690+ CVEs/sec locally.
Thought Process and Architectural Rational
1. Solving the N+1 Ingestion Bottleneck
The core issue was that each CVE record, along with its associated
Description,AffectedProduct,Version, andReferenceobjects, was being saved one-by-one. In a database with triggers and constraints, this overhead accumulates per row.Solution:
CveBulkContextwas implemented as a memory-resident buffer that collects model instances and M2M link tuples. Instead of immediate saving, it accumulates a batch of CVEs and then flushes them to PostgreSQL usingbulk_createinside a single transaction.2. Topological Dependency Ordering
Bulk insertion requires strict ordering due to foreign key constraints. The flush order is:
Tag,Platform,Cpe,OrganizationDescription,Metric,Reference,EventCveRecordContainer,AffectedProduct3. Suppressing PostgreSQL Triggers
The
pgtriggersearch vector updates onDescription,Container, andAffectedProductwere the single biggest CPU bottleneck on the DB side. By wrapping the bulk flush inpgtrigger.ignore(), row-level indexing is bypassed during ingestion. A singleupdate_search_vectors()call at the end of ingestion updates all indices in aggregate, which is significantly faster than per-row trigger execution.4. Parallel CPU-Bound Parsing
JSON parsing and object instantiation are CPU-bound tasks. On a single thread, available cores were left idle.
ProcessPoolExecutoris used in the management command to chunk the file list across worker processes that return populatedCveBulkContextpayloads. The main process handles serial database writes, effectively overlapping the CPU-heavy parsing stage with the IO-heavy DB write stage.Changes
Core logic (
src/shared/bulk_ingestion.py):CveBulkContextdataclass for memory bufferingflush()for atomic, ordered bulk writesupdate_search_vectors()using aggregate SQL updatesprepare_organizationto defer DB lookups, enabling worker process isolationSequenceandMappingto handle model covarianceManagement commands (
src/shared/management/commands/):ingest_bulk_cve.py: Rewritten to useProcessPoolExecutorwith a chunked worker patterningest_delta_cve.py: Updated to useCveBulkContextand deferred search vector updates, ensuring consistency between full and incremental syncsTesting (
src/shared/tests/):test_bulk_ingestion.py[NEW]: Comprehensive suite verifying topological sort order and M2M link consistencytest_ingest_bulk_cve.py: Updated to support mockedProcessPoolExecutor(viaThreadPoolExecutorfor test stability) and patched database contextsPerformance
Final ingestion log (full dataset, 2019–2025, localhost):
INFO 2026-04-01 14:21:04,711 ingest_bulk_cve 45299 Fetched latest release: CVE 2026-04-01_1300Z INFO 2026-04-01 14:21:06,838 ingest_bulk_cve 45299 Flushing batch of 2293 CVEs... INFO 2026-04-01 14:21:12,664 ingest_bulk_cve 45299 Flushing batch of 4589 CVEs... INFO 2026-04-01 14:21:22,778 ingest_bulk_cve 45299 Flushing batch of 4999 CVEs... INFO 2026-04-01 14:21:35,803 ingest_bulk_cve 45299 Flushing batch of 4997 CVEs... INFO 2026-04-01 14:21:51,275 ingest_bulk_cve 45299 Flushing batch of 4997 CVEs... ... INFO 2026-04-01 14:25:38,386 ingest_bulk_cve 45299 189321 CVEs ingested. INFO 2026-04-01 14:25:38,386 ingest_bulk_cve 45299 Updating search vectors in bulk... INFO 2026-04-01 14:25:38,555 ingest_bulk_cve 45299 Search vectors updated successfully. INFO 2026-04-01 14:25:38,555 ingest_bulk_cve 45299 Saving the ingestion valid up to 2026-04-01Total: 189,321 CVEs in 274 seconds (~691 CVE/s)