`search-zip` should use standardised file detection instead of extension

I realise that this is effectively #2085, but I don't really think this is a niche use case, considering how easily ripgrep could implement this.

Essentially, pretty much every standardised compression format has a magic number in the header that you can rely upon instead of the extension, and it feels like it would be better to use this than the extension.

While I also think it would be beneficial to just use pure-Rust implementations of the decompression instead of external programs (which would mean all this header detection is essentially already included with the code), I get not doing so, and think that this is a reasonable compromise.

Note that this is different than, say, expecting language-format detection to use a more complicated algorithm, and is actually pretty trivial to do if you're *already* going to be opening the files to determine whether they're text or not.

Just to list demonstrate how simple these are, I decided to find the headers for all the formats ripgrep supports:

* `gzip`: `1F 8B`
* `bzip`: `42 5A 68`
* `xz`: `FD 37 7A 58 5A 00`
* `lz4`: `04 22 4D 18`
* `lzma`: `__ __ __ __ __ FF FF FF FF FF FF FF FF` (see footnotes)
* `br`: (not available)
* `zstd`: `FD 2F B5`
* `uncompress`: `1F 9D`

Footnotes:

* Brotli has no dedicated magic number, so, you do have to rely on the `.br` extension.
* LZMA is in a similar boat, although it has a fairly reliable 8-byte indicator for streamed data that is used by `libmagic` as a reliable detector.
* All these numbers were taken from either the [List of file signatures Wikipedia page](https://en.wikipedia.org/wiki/List_of_file_signatures) (easy to read) or directly from the [source code of libmagic](https://github.com/file/file/blob/master/magic/Magdir/compress) (harder to read).

As mentioned, since these files are being opened for reading *anyway*, it would be helpful to just automatically detect these files instead of relying on the extension, which isn't always present. As an example, some common formats like `mtree` are usually compressed by another format by default (in this case, gzip) even though they're just text, and rather than suggesting that ripgrep account for every one of these particular examples, just reading two bytes of the file itself seems like a much more robust solution.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

`search-zip` should use standardised file detection instead of extension #3342

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Uh oh!

search-zip should use standardised file detection instead of extension #3342

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions

`search-zip` should use standardised file detection instead of extension #3342