refactor(datasource/rpm): extract xml metadata provider by kpumuk · Pull Request #41910 · renovatebot/renovate

kpumuk · 2026-03-13T18:00:44Z

Changes

This pull request was split from #41866, and only contains the basic refactoring of the RPM data source to extract primary (XML) package format.

Previously, the RPM datasource mixed repository metadata discovery and XML package parsing inside a single class. That made the code harder to extend for additional metadata backends and made the later SQLite work more invasive than it needed to be.

Note

This refactor intentionally keeps one small behavior improvement that was already folded into the extraction: malformed RPM XML entries with a <version> element missing the ver attribute are now ignored instead of producing a release with version undefined.

That is a behavior change for invalid metadata, but it is a safer outcome and aligns better with the intended RPM parsing behavior.

In the #41910 (comment) discussion an additional improvement was proposed to use a file on disk for unpacking the XML, which led to significant memory savings. In the benchmark over 50 Amazon Linux 2 dependencies, the results are:

Version	Avg Wall Time (3 runs)	Avg Peak RSS (3 runs)
Before (`upstream/main`)	416.55s	1578.2 MiB
After (`rpm-xml-provider`)	416.93s (`+0.09%`)	533.8 MiB (`-66.2%`)

Context

Please select one of the following:

This closes an existing Issue, Closes: #
This doesn't close an Issue, but I accept the risk that this PR may be closed if maintainers disagree with its opening or implementation

AI assistance disclosure

Did you use AI tools to create any part of this pull request?

Please select one option and, if yes, briefly describe how AI was used (e.g., code, tests, docs) and which tool(s) you used.

No — I did not use AI for this contribution.
Yes — minimal assistance (e.g., IDE autocomplete, small code completions, grammar fixes).
Yes — substantive assistance (AI-generated non‑trivial portions of code, tests, or documentation).
Yes — other (please describe):

Documentation (please check one with an [x])

I have updated the documentation, or
No documentation update is required

How I've tested my work (please select one)

I have verified these changes via:

Code inspection only, or
Newly added/modified unit tests, or
No unit tests, but ran on a real repository, or
Both unit tests + ran on a real repository

The public repository:

viceice

you can do the buffer change in a follow up PR if it feels better to you

viceice · 2026-03-17T07:58:57Z

      return await this.getReleasesByPackageName(primaryGzipUrl, packageName);
    } catch (err) {
-      this.handleGenericErrors(err);
+      this.handleGenericErrors(err as Error);


is that change really needed?

A left-over from me fighting linter in another place. Will drop it

viceice · 2026-03-17T08:00:49Z


-  // Fetches the primary.xml.gz URL from the repomd.xml file.
-  private async _getPrimaryGzipUrl(
+  private getPrimaryRepodataUrl(


I think this can be moved to separate file as it no longer needs this context?

Makes sense. I can extract the repomd parsing helper into a small helper module so index.ts is just fetch/cache/orchestration.

viceice · 2026-03-17T08:02:38Z

+  url: string,
+): Promise<Buffer> {
+  try {
+    const response = await http.getBuffer(url);


this should be refactored to use a cache file, so we don't hold the huge buffer in memory

This is a great suggestion. In fact, I got my CI worker killed today due to OOM, and I strongly suspect this code. I will try to generalize this to fit both this branch and the primary_db/sqlite (which already materializes the database on disk, but it still unzips it in memory).

Between two choices withCache() with pre-defined TTL and file-based freshness comparison (similar to how Debian data-source does) I went with If-Modified-Since and timestamps. Let me know if the withCache() is preferred.

The benchmark is promising. I did 3 runs over the 50 Amazon Linux 2023 dependencies:

Branch Run 1 Wall Time Run 2 Wall Time Run 3 Wall Time Avg Wall Time Run 1 Peak RSS Run 2 Peak RSS Run 3 Peak RSS Avg Peak RSS

upstream/main 418.05s 404.64s 426.97s 416.55s 1425.1 MiB 1432.1 MiB 1877.5 MiB 1578.2 MiB

rpm-xml-provider 419.05s 420.19s 411.54s 416.93s 535.7 MiB 534.9 MiB 531.0 MiB 533.8 MiB

Delta (rpm-xml-provider vs upstream/main) +1.00s +15.55s -15.43s +0.37s (+0.09%) -889.4 MiB -897.2 MiB -1346.5 MiB -1044.4 MiB (-66.2%)

23e860f

I have added extension parameter to all helpers, so I can re-use it later for primary_db change

viceice · 2026-03-17T08:05:02Z

+        setImmediate(() => saxParser.removeAllListeners());
+        resolve();
+      });
+      Readable.from(decompressedBuffer).pipe(saxParser);


we should use an available pipeline here and await this and the outer promise . maybe the outer isn't needed then.

A direct pipeline(Readable.from(buffer), saxParser) would look simpler, but it does not work correctly here. sax.createStream() is duplex, and with no consumer on its readable side the pipeline hangs.

So if we switch this to pipeline(...), it needs a small writable adapter in front of the SAX parser. That keeps the pipeline lifecycle handling, while still letting SAX drive parsing and surface parser errors correctly.

diff --git a/lib/modules/datasource/rpm/providers/xml.ts b/lib/modules/datasource/rpm/providers/xml.ts index 06bd1bb7e..16d6a51c6 100644 --- a/lib/modules/datasource/rpm/providers/xml.ts +++ b/lib/modules/datasource/rpm/providers/xml.ts @@ -1,5 +1,6 @@ -import { Readable } from 'node:stream'; +import { Readable, Writable } from 'node:stream'; import sax from 'sax'; +import { pipeline } from 'stream/promises'; import { logger } from '../../../../logger/index.ts'; import type { Http } from '../../../../util/http/index.ts'; import type { ReleaseResult } from '../../types.ts'; @@ -67,25 +68,34 @@ export class RpmXmlMetadataProvider { } }); - await new Promise<void>((resolve, reject) => { - let settled = false; - saxParser.on('error', (err: Error) => { - if (settled) { - return; - } - settled = true; - logger.debug(`SAX parsing error in ${primaryGzipUrl}: ${err.message}`); - setImmediate(() => saxParser.removeAllListeners()); - reject(err); - }); - saxParser.on('end', () => { - settled = true; - setImmediate(() => saxParser.removeAllListeners()); - resolve(); - }); - Readable.from(decompressedBuffer).pipe(saxParser); + let parserError: Error | undefined; + saxParser.on('error', (err: Error) => { + if (parserError) { + return; + } + + parserError = err; + logger.debug( + `SAX parsing error in ${primaryGzipUrl}: ${ + err instanceof Error ? err.message : err + }`, + ); }); + await pipeline( + Readable.from(decompressedBuffer), + new Writable({ + write(chunk, _encoding, callback) { + saxParser.write(chunk); + callback(parserError); + }, + final(callback) { + saxParser.end(); + callback(parserError); + }, + }), + ); + const result = buildReleaseResult(releases); if (!result) { logger.trace(

kpumuk · 2026-03-26T00:02:36Z

@viceice would you mind re-checking this proposal? it did grew a little bit from the initial simple refactoring, but I think even on its own without primary_db this change provides a significant memory optimization.

kpumuk · 2026-04-16T19:10:46Z

Added one more commit after finding an interesting edge case.

Before this change, a metadata refresh would extract directly into the shared cached XML file. If the downloaded archive was bad or truncated, we could partially overwrite a previously good cache entry and still return that path, leaving later lookups to read corrupted metadata. There was also a race where one lookup could be parsing the file while another was rewriting it in place.

The fix is to extract into a temporary file first and only replace the shared cache file after extraction succeeds. That means readers see either the old complete file or the new complete file, and a failed refresh preserves the previous known-good cache.

This is not unique to RPM. The Debian datasource currently uses the same in-place extraction pattern, so it is susceptible to the same class of issue and should get the same temp-file + atomic-replace treatment in a follow-up.

github-actions Bot requested a review from viceice March 13, 2026 18:00

kpumuk mentioned this pull request Mar 13, 2026

feat(datasource/rpm): use primary_db metadata with configurable fallback #41866

Open

12 tasks

viceice reviewed Mar 17, 2026

View reviewed changes

kpumuk force-pushed the rpm-xml-provider branch from 169e3b9 to 7ede374 Compare March 17, 2026 19:41

kpumuk requested a review from viceice March 20, 2026 21:31

kpumuk added 5 commits April 16, 2026 13:54

refactor(datasource/rpm): extract xml metadata provider

d65e603

refactor(datasource/rpm): remove unnecessary error cast

753708a

refactor(datasource/rpm): extract repomd helpers

80559b2

refactor(datasource/rpm): cache extracted metadata file

8451786

test(datasource/rpm): cover cache helper error paths

18ea133

kpumuk force-pushed the rpm-xml-provider branch from 7ede374 to 18ea133 Compare April 16, 2026 17:54

fix(datasource/rpm): preserve cached metadata on refresh failure

a1b2c9b

Branch	Run 1 Wall Time	Run 2 Wall Time	Run 3 Wall Time	Avg Wall Time	Run 1 Peak RSS	Run 2 Peak RSS	Run 3 Peak RSS	Avg Peak RSS
`upstream/main`	418.05s	404.64s	426.97s	416.55s	1425.1 MiB	1432.1 MiB	1877.5 MiB	1578.2 MiB
`rpm-xml-provider`	419.05s	420.19s	411.54s	416.93s	535.7 MiB	534.9 MiB	531.0 MiB	533.8 MiB
Delta (`rpm-xml-provider` vs `upstream/main`)	+1.00s	+15.55s	-15.43s	+0.37s (`+0.09%`)	-889.4 MiB	-897.2 MiB	-1346.5 MiB	-1044.4 MiB (`-66.2%`)

Conversation

kpumuk commented Mar 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Changes

Context

AI assistance disclosure

Documentation (please check one with an [x])

How I've tested my work (please select one)

Uh oh!

viceice left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kpumuk Mar 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kpumuk commented Mar 26, 2026

Uh oh!

kpumuk commented Apr 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

kpumuk commented Mar 13, 2026 •

edited

Loading

kpumuk Mar 17, 2026 •

edited

Loading