Mikael noted in my previous post that Con Kolivas’s lrzip is another interesting compressor. In fact, Con has already done a simple 64-bit enhance of rzip for lrzip, and on our example file it gets 56M vs 55M for xz (lrzip in raw mode, followed by xz, gives 100k worse than just using lrzip: lrzip already uses lzma).
Assuming no bugs in rzip, the takeaway here is simple: rzip should not attempt to find matches within the range that the backend compressor (900k for bzip2 in rzip, 32k for gzip, megabytes for LZMA as used by lrzip). The backend compressor will do a better job (as shown by similar results with lrzip when I increase the hash array size so it finds more matches: the resulting file is larger).
The rzip algorithm is good at finding matches over huge distances, and that is what it should stick to. Huge here == size of file (rzip does not stream, for this reason). And this implies only worrying about large matches over huge distances (the current 32 byte minimum is probably too small). The current version of rzip uses an mmap window so it never has to seek, but this window is artificially limited to 900MB (or 60% of mem in lrzip). If we carefully limit the number of comparisons with previous parts of the file, we may be able to reduce them to the point where we don’t confuse the readahead algorithms and thus get nice performance (fadvise may help here too) whether we are mmaped or seeking.
I like the idea that rzip should scale with the size of the file being compressed, not make assumptions about today’s memory sizes. Though some kind of thrash detection using mincore might be necessary to avoid killing our dumb mm systems :(