Skip to content

search-zip should use standardised file detection instead of extension #3342

@clarfonthey

Description

@clarfonthey

I realise that this is effectively #2085, but I don't really think this is a niche use case, considering how easily ripgrep could implement this.

Essentially, pretty much every standardised compression format has a magic number in the header that you can rely upon instead of the extension, and it feels like it would be better to use this than the extension.

While I also think it would be beneficial to just use pure-Rust implementations of the decompression instead of external programs (which would mean all this header detection is essentially already included with the code), I get not doing so, and think that this is a reasonable compromise.

Note that this is different than, say, expecting language-format detection to use a more complicated algorithm, and is actually pretty trivial to do if you're already going to be opening the files to determine whether they're text or not.

Just to list demonstrate how simple these are, I decided to find the headers for all the formats ripgrep supports:

  • gzip: 1F 8B
  • bzip: 42 5A 68
  • xz: FD 37 7A 58 5A 00
  • lz4: 04 22 4D 18
  • lzma: __ __ __ __ __ FF FF FF FF FF FF FF FF (see footnotes)
  • br: (not available)
  • zstd: FD 2F B5
  • uncompress: 1F 9D

Footnotes:

  • Brotli has no dedicated magic number, so, you do have to rely on the .br extension.
  • LZMA is in a similar boat, although it has a fairly reliable 8-byte indicator for streamed data that is used by libmagic as a reliable detector.
  • All these numbers were taken from either the List of file signatures Wikipedia page (easy to read) or directly from the source code of libmagic (harder to read).

As mentioned, since these files are being opened for reading anyway, it would be helpful to just automatically detect these files instead of relying on the extension, which isn't always present. As an example, some common formats like mtree are usually compressed by another format by default (in this case, gzip) even though they're just text, and rather than suggesting that ripgrep account for every one of these particular examples, just reading two bytes of the file itself seems like a much more robust solution.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions