refactor(datasource/rpm): extract xml metadata provider#41910
refactor(datasource/rpm): extract xml metadata provider#41910kpumuk wants to merge 6 commits intorenovatebot:mainfrom
Conversation
viceice
left a comment
There was a problem hiding this comment.
you can do the buffer change in a follow up PR if it feels better to you
| return await this.getReleasesByPackageName(primaryGzipUrl, packageName); | ||
| } catch (err) { | ||
| this.handleGenericErrors(err); | ||
| this.handleGenericErrors(err as Error); |
There was a problem hiding this comment.
A left-over from me fighting linter in another place. Will drop it
|
|
||
| // Fetches the primary.xml.gz URL from the repomd.xml file. | ||
| private async _getPrimaryGzipUrl( | ||
| private getPrimaryRepodataUrl( |
There was a problem hiding this comment.
I think this can be moved to separate file as it no longer needs this context?
There was a problem hiding this comment.
Makes sense. I can extract the repomd parsing helper into a small helper module so index.ts is just fetch/cache/orchestration.
| url: string, | ||
| ): Promise<Buffer> { | ||
| try { | ||
| const response = await http.getBuffer(url); |
There was a problem hiding this comment.
this should be refactored to use a cache file, so we don't hold the huge buffer in memory
There was a problem hiding this comment.
This is a great suggestion. In fact, I got my CI worker killed today due to OOM, and I strongly suspect this code. I will try to generalize this to fit both this branch and the primary_db/sqlite (which already materializes the database on disk, but it still unzips it in memory).
Between two choices withCache() with pre-defined TTL and file-based freshness comparison (similar to how Debian data-source does) I went with If-Modified-Since and timestamps. Let me know if the withCache() is preferred.
The benchmark is promising. I did 3 runs over the 50 Amazon Linux 2023 dependencies:
| Branch | Run 1 Wall Time | Run 2 Wall Time | Run 3 Wall Time | Avg Wall Time | Run 1 Peak RSS | Run 2 Peak RSS | Run 3 Peak RSS | Avg Peak RSS |
|---|---|---|---|---|---|---|---|---|
upstream/main |
418.05s | 404.64s | 426.97s | 416.55s | 1425.1 MiB | 1432.1 MiB | 1877.5 MiB | 1578.2 MiB |
rpm-xml-provider |
419.05s | 420.19s | 411.54s | 416.93s | 535.7 MiB | 534.9 MiB | 531.0 MiB | 533.8 MiB |
Delta (rpm-xml-provider vs upstream/main) |
+1.00s | +15.55s | -15.43s | +0.37s (+0.09%) |
-889.4 MiB | -897.2 MiB | -1346.5 MiB | -1044.4 MiB (-66.2%) |
There was a problem hiding this comment.
I have added extension parameter to all helpers, so I can re-use it later for primary_db change
| setImmediate(() => saxParser.removeAllListeners()); | ||
| resolve(); | ||
| }); | ||
| Readable.from(decompressedBuffer).pipe(saxParser); |
There was a problem hiding this comment.
we should use an available pipeline here and await this and the outer promise . maybe the outer isn't needed then.
There was a problem hiding this comment.
A direct pipeline(Readable.from(buffer), saxParser) would look simpler, but it does not work correctly here. sax.createStream() is duplex, and with no consumer on its readable side the pipeline hangs.
So if we switch this to pipeline(...), it needs a small writable adapter in front of the SAX parser. That keeps the pipeline lifecycle handling, while still letting SAX drive parsing and surface parser errors correctly.
diff --git a/lib/modules/datasource/rpm/providers/xml.ts b/lib/modules/datasource/rpm/providers/xml.ts
index 06bd1bb7e..16d6a51c6 100644
--- a/lib/modules/datasource/rpm/providers/xml.ts
+++ b/lib/modules/datasource/rpm/providers/xml.ts
@@ -1,5 +1,6 @@
-import { Readable } from 'node:stream';
+import { Readable, Writable } from 'node:stream';
import sax from 'sax';
+import { pipeline } from 'stream/promises';
import { logger } from '../../../../logger/index.ts';
import type { Http } from '../../../../util/http/index.ts';
import type { ReleaseResult } from '../../types.ts';
@@ -67,25 +68,34 @@ export class RpmXmlMetadataProvider {
}
});
- await new Promise<void>((resolve, reject) => {
- let settled = false;
- saxParser.on('error', (err: Error) => {
- if (settled) {
- return;
- }
- settled = true;
- logger.debug(`SAX parsing error in ${primaryGzipUrl}: ${err.message}`);
- setImmediate(() => saxParser.removeAllListeners());
- reject(err);
- });
- saxParser.on('end', () => {
- settled = true;
- setImmediate(() => saxParser.removeAllListeners());
- resolve();
- });
- Readable.from(decompressedBuffer).pipe(saxParser);
+ let parserError: Error | undefined;
+ saxParser.on('error', (err: Error) => {
+ if (parserError) {
+ return;
+ }
+
+ parserError = err;
+ logger.debug(
+ `SAX parsing error in ${primaryGzipUrl}: ${
+ err instanceof Error ? err.message : err
+ }`,
+ );
});
+ await pipeline(
+ Readable.from(decompressedBuffer),
+ new Writable({
+ write(chunk, _encoding, callback) {
+ saxParser.write(chunk);
+ callback(parserError);
+ },
+ final(callback) {
+ saxParser.end();
+ callback(parserError);
+ },
+ }),
+ );
+
const result = buildReleaseResult(releases);
if (!result) {
logger.trace(169e3b9 to
7ede374
Compare
|
@viceice would you mind re-checking this proposal? it did grew a little bit from the initial simple refactoring, but I think even on its own without |
7ede374 to
18ea133
Compare
|
Added one more commit after finding an interesting edge case. Before this change, a metadata refresh would extract directly into the shared cached XML file. If the downloaded archive was bad or truncated, we could partially overwrite a previously good cache entry and still return that path, leaving later lookups to read corrupted metadata. There was also a race where one lookup could be parsing the file while another was rewriting it in place. The fix is to extract into a temporary file first and only replace the shared cache file after extraction succeeds. That means readers see either the old complete file or the new complete file, and a failed refresh preserves the previous known-good cache. This is not unique to RPM. The Debian datasource currently uses the same in-place extraction pattern, so it is susceptible to the same class of issue and should get the same temp-file + atomic-replace treatment in a follow-up. |
Changes
This pull request was split from #41866, and only contains the basic refactoring of the RPM data source to extract
primary(XML) package format.Previously, the RPM datasource mixed repository metadata discovery and XML package parsing inside a single class. That made the code harder to extend for additional metadata backends and made the later SQLite work more invasive than it needed to be.
Note
This refactor intentionally keeps one small behavior improvement that was already folded into the extraction: malformed RPM XML entries with a
<version>element missing theverattribute are now ignored instead of producing a release with versionundefined.That is a behavior change for invalid metadata, but it is a safer outcome and aligns better with the intended RPM parsing behavior.
In the #41910 (comment) discussion an additional improvement was proposed to use a file on disk for unpacking the XML, which led to significant memory savings. In the benchmark over 50 Amazon Linux 2 dependencies, the results are:
upstream/main)rpm-xml-provider)+0.09%)-66.2%)Context
Please select one of the following:
AI assistance disclosure
Did you use AI tools to create any part of this pull request?
Please select one option and, if yes, briefly describe how AI was used (e.g., code, tests, docs) and which tool(s) you used.
Documentation (please check one with an [x])
How I've tested my work (please select one)
I have verified these changes via:
The public repository: