Skip to content

refactor(datasource/rpm): extract xml metadata provider#41910

Open
kpumuk wants to merge 6 commits intorenovatebot:mainfrom
kpumuk:rpm-xml-provider
Open

refactor(datasource/rpm): extract xml metadata provider#41910
kpumuk wants to merge 6 commits intorenovatebot:mainfrom
kpumuk:rpm-xml-provider

Conversation

@kpumuk
Copy link
Copy Markdown
Contributor

@kpumuk kpumuk commented Mar 13, 2026

Changes

This pull request was split from #41866, and only contains the basic refactoring of the RPM data source to extract primary (XML) package format.

Previously, the RPM datasource mixed repository metadata discovery and XML package parsing inside a single class. That made the code harder to extend for additional metadata backends and made the later SQLite work more invasive than it needed to be.

Note

This refactor intentionally keeps one small behavior improvement that was already folded into the extraction: malformed RPM XML entries with a <version> element missing the ver attribute are now ignored instead of producing a release with version undefined.

That is a behavior change for invalid metadata, but it is a safer outcome and aligns better with the intended RPM parsing behavior.

In the #41910 (comment) discussion an additional improvement was proposed to use a file on disk for unpacking the XML, which led to significant memory savings. In the benchmark over 50 Amazon Linux 2 dependencies, the results are:

Version Avg Wall Time (3 runs) Avg Peak RSS (3 runs)
Before (upstream/main) 416.55s 1578.2 MiB
After (rpm-xml-provider) 416.93s (+0.09%) 533.8 MiB (-66.2%)

Context

Please select one of the following:

  • This closes an existing Issue, Closes: #
  • This doesn't close an Issue, but I accept the risk that this PR may be closed if maintainers disagree with its opening or implementation

AI assistance disclosure

Did you use AI tools to create any part of this pull request?

Please select one option and, if yes, briefly describe how AI was used (e.g., code, tests, docs) and which tool(s) you used.

  • No — I did not use AI for this contribution.
  • Yes — minimal assistance (e.g., IDE autocomplete, small code completions, grammar fixes).
  • Yes — substantive assistance (AI-generated non‑trivial portions of code, tests, or documentation).
  • Yes — other (please describe):

Documentation (please check one with an [x])

  • I have updated the documentation, or
  • No documentation update is required

How I've tested my work (please select one)

I have verified these changes via:

  • Code inspection only, or
  • Newly added/modified unit tests, or
  • No unit tests, but ran on a real repository, or
  • Both unit tests + ran on a real repository

The public repository:

Copy link
Copy Markdown
Member

@viceice viceice left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can do the buffer change in a follow up PR if it feels better to you

Comment thread lib/modules/datasource/rpm/index.ts Outdated
return await this.getReleasesByPackageName(primaryGzipUrl, packageName);
} catch (err) {
this.handleGenericErrors(err);
this.handleGenericErrors(err as Error);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is that change really needed?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A left-over from me fighting linter in another place. Will drop it

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment thread lib/modules/datasource/rpm/index.ts Outdated

// Fetches the primary.xml.gz URL from the repomd.xml file.
private async _getPrimaryGzipUrl(
private getPrimaryRepodataUrl(
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this can be moved to separate file as it no longer needs this context?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense. I can extract the repomd parsing helper into a small helper module so index.ts is just fetch/cache/orchestration.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

url: string,
): Promise<Buffer> {
try {
const response = await http.getBuffer(url);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should be refactored to use a cache file, so we don't hold the huge buffer in memory

Copy link
Copy Markdown
Contributor Author

@kpumuk kpumuk Mar 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a great suggestion. In fact, I got my CI worker killed today due to OOM, and I strongly suspect this code. I will try to generalize this to fit both this branch and the primary_db/sqlite (which already materializes the database on disk, but it still unzips it in memory).

Between two choices withCache() with pre-defined TTL and file-based freshness comparison (similar to how Debian data-source does) I went with If-Modified-Since and timestamps. Let me know if the withCache() is preferred.

The benchmark is promising. I did 3 runs over the 50 Amazon Linux 2023 dependencies:

Branch Run 1 Wall Time Run 2 Wall Time Run 3 Wall Time Avg Wall Time Run 1 Peak RSS Run 2 Peak RSS Run 3 Peak RSS Avg Peak RSS
upstream/main 418.05s 404.64s 426.97s 416.55s 1425.1 MiB 1432.1 MiB 1877.5 MiB 1578.2 MiB
rpm-xml-provider 419.05s 420.19s 411.54s 416.93s 535.7 MiB 534.9 MiB 531.0 MiB 533.8 MiB
Delta (rpm-xml-provider vs upstream/main) +1.00s +15.55s -15.43s +0.37s (+0.09%) -889.4 MiB -897.2 MiB -1346.5 MiB -1044.4 MiB (-66.2%)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

23e860f

I have added extension parameter to all helpers, so I can re-use it later for primary_db change

setImmediate(() => saxParser.removeAllListeners());
resolve();
});
Readable.from(decompressedBuffer).pipe(saxParser);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should use an available pipeline here and await this and the outer promise . maybe the outer isn't needed then.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A direct pipeline(Readable.from(buffer), saxParser) would look simpler, but it does not work correctly here. sax.createStream() is duplex, and with no consumer on its readable side the pipeline hangs.

So if we switch this to pipeline(...), it needs a small writable adapter in front of the SAX parser. That keeps the pipeline lifecycle handling, while still letting SAX drive parsing and surface parser errors correctly.

diff --git a/lib/modules/datasource/rpm/providers/xml.ts b/lib/modules/datasource/rpm/providers/xml.ts
index 06bd1bb7e..16d6a51c6 100644
--- a/lib/modules/datasource/rpm/providers/xml.ts
+++ b/lib/modules/datasource/rpm/providers/xml.ts
@@ -1,5 +1,6 @@
-import { Readable } from 'node:stream';
+import { Readable, Writable } from 'node:stream';
 import sax from 'sax';
+import { pipeline } from 'stream/promises';
 import { logger } from '../../../../logger/index.ts';
 import type { Http } from '../../../../util/http/index.ts';
 import type { ReleaseResult } from '../../types.ts';
@@ -67,25 +68,34 @@ export class RpmXmlMetadataProvider {
       }
     });
 
-    await new Promise<void>((resolve, reject) => {
-      let settled = false;
-      saxParser.on('error', (err: Error) => {
-        if (settled) {
-          return;
-        }
-        settled = true;
-        logger.debug(`SAX parsing error in ${primaryGzipUrl}: ${err.message}`);
-        setImmediate(() => saxParser.removeAllListeners());
-        reject(err);
-      });
-      saxParser.on('end', () => {
-        settled = true;
-        setImmediate(() => saxParser.removeAllListeners());
-        resolve();
-      });
-      Readable.from(decompressedBuffer).pipe(saxParser);
+    let parserError: Error | undefined;
+    saxParser.on('error', (err: Error) => {
+      if (parserError) {
+        return;
+      }
+
+      parserError = err;
+      logger.debug(
+        `SAX parsing error in ${primaryGzipUrl}: ${
+          err instanceof Error ? err.message : err
+        }`,
+      );
     });
 
+    await pipeline(
+      Readable.from(decompressedBuffer),
+      new Writable({
+        write(chunk, _encoding, callback) {
+          saxParser.write(chunk);
+          callback(parserError);
+        },
+        final(callback) {
+          saxParser.end();
+          callback(parserError);
+        },
+      }),
+    );
+
     const result = buildReleaseResult(releases);
     if (!result) {
       logger.trace(

@kpumuk kpumuk force-pushed the rpm-xml-provider branch from 169e3b9 to 7ede374 Compare March 17, 2026 19:41
@kpumuk kpumuk requested a review from viceice March 20, 2026 21:31
@kpumuk
Copy link
Copy Markdown
Contributor Author

kpumuk commented Mar 26, 2026

@viceice would you mind re-checking this proposal? it did grew a little bit from the initial simple refactoring, but I think even on its own without primary_db this change provides a significant memory optimization.

@kpumuk kpumuk force-pushed the rpm-xml-provider branch from 7ede374 to 18ea133 Compare April 16, 2026 17:54
@kpumuk
Copy link
Copy Markdown
Contributor Author

kpumuk commented Apr 16, 2026

Added one more commit after finding an interesting edge case.

Before this change, a metadata refresh would extract directly into the shared cached XML file. If the downloaded archive was bad or truncated, we could partially overwrite a previously good cache entry and still return that path, leaving later lookups to read corrupted metadata. There was also a race where one lookup could be parsing the file while another was rewriting it in place.

The fix is to extract into a temporary file first and only replace the shared cache file after extraction succeeds. That means readers see either the old complete file or the new complete file, and a failed refresh preserves the previous known-good cache.

This is not unique to RPM. The Debian datasource currently uses the same in-place extraction pattern, so it is susceptible to the same class of issue and should get the same temp-file + atomic-replace treatment in a follow-up.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants