Followup: lrzip

Mikael noted in my previous post that Con Kolivas’s lrzip is another interesting compressor.  In fact, Con has already done a simple 64-bit enhance of rzip for lrzip, and on our example file it gets 56M vs 55M for xz (lrzip in raw mode, followed by xz, gives 100k worse than just using lrzip: lrzip already uses lzma).

Assuming no bugs in rzip, the takeaway here is simple: rzip should not attempt to find matches within the range that the backend compressor (900k for bzip2 in rzip, 32k for gzip, megabytes for LZMA as used by lrzip).  The backend compressor will do a better job (as shown by similar results with lrzip when I increase the hash array size so it finds more matches: the resulting file is larger).

The rzip algorithm is good at finding matches over huge distances, and that is what it should stick to.  Huge here == size of file (rzip does not stream, for this reason).  And this implies only worrying about large matches over huge distances (the current 32 byte minimum is probably too small).  The current version of rzip uses an mmap window so it never has to seek, but this window is artificially limited to 900MB (or 60% of mem in lrzip).   If we carefully limit the number of comparisons with previous parts of the file, we may be able to reduce them to the point where we don’t confuse the readahead algorithms and thus get nice performance (fadvise may help here too) whether we are mmaped or seeking.

I like the idea that rzip should scale with the size of the file being compressed, not make assumptions about today’s memory sizes.  Though some kind of thrash detection using mincore might be necessary to avoid killing our dumb mm systems :(

6 replies on “Followup: lrzip”

  1. What about 7z which also uses LZMA. If I remember correctly it offer pretty good compressions ratios. The 7-Zip compressor/decompressor is covered by the GPL and I’m pretty sure it is completely unencumbered by patents.

  2. Another suggestion for you to experiment with.

    Tar the files sorted by file extension instead of file name (if you don’t already).

    When we did this for the Strawberry Perl MSI installer, we saw a 10% reduction in file size, on the assumption that files of a similar file extension probably have similar content, and thus grouping them will feed the compression algorithm better.

  3. Interesting! OK, so I tested that here:

    $ find linux-2.6 -name .git -prune -o -print | sed 's,\([./][^./]*$\),\1 \1,' | sort -k2 | cut -d\ -f1 > /tmp/sorted-list
    $ tar cf /tmp/linux.2.6.sorted.tar -T /tmp/sorted-list --no-recursion
    $ ls -l /tmp/linux.2.6.tar /tmp/linux.2.6.sorted.tar
    -rw-r--r-- 1 rusty rusty 395048960 2010-02-17 10:06 /tmp/linux.2.6.sorted.tar
    -rw-r--r-- 1 rusty rusty 395048960 2010-02-16 09:03 /tmp/linux.2.6.tar

    OK, now the results for each compressor (-9 –to-stdout or -M for lrzip):

    gzip /tmp/linux.2.6.tar: 86388655
    gzip /tmp/linux.2.6.sorted.tar: 84904755
    bzip2 /tmp/linux.2.6.tar: 67350827
    bzip2 /tmp/linux.2.6.sorted.tar: 66784160
    xz /tmp/linux.2.6.tar: 54614828
    xz /tmp/linux.2.6.sorted.tar: 55100332
    lrzip /tmp/linux.2.6.tar: 56500239
    lrzip /tmp/linux.2.6.sorted.tar: 56104936

    The xz result is flat-out weird; AFAICT any attempt to help xz with compression makes it worse. I wonder if it was actually optimized using the linux kernel source?

    (PS. Added comment preview, since it was annoying me).

  4. Hi Rusty.

    lrzip only becomes useful once the simple long distance dictionary rzip stage is significantly larger than the compression window of the backend (lzma). I doubt you’ll find the rzip first stage helps the kernel source at its current size when lzma’s compression windows are the size they already are. You’ll only start getting a benefit with a much larger codebase… which will be in not so long for linux kernel :P . If you see the benchmarks I posted, lrzip only has an advantage with massive files (and massive ram) when there is long distance redundancy. So if you start including multiple trees instead of a single kernel version, you’ll start seeing a benefit from the rzip stage. By a (not so strange) coincidence I also used kernel trees to test it.

    As for bugs in rzip, I found it spat the dummy on greater than 32bit signed in multiple ways and suspect that’s why it had the artificial 900MB limit. Dealing with 32bit, mmap, write() limitations and so on has been a bit of a pain when trying to scale to 64bit sizes I have to admit. That’s been the bulk of the work that went into lrzip. In the process it actually made the compression a little less efficient and slower than rzip due to the 64bit offsets in the archives. xz’s performance on massive files was not so great, being both much slower than 7z (due to not being multithreaded) and not even compressing as well as it. I often wonder if the ground swell for xz on linux is well founded, but then there’s always more to software selection than merit on purely technical grounds. By that I don’t just mean politically…

    1. Using back-to-back kernel sources isn’t a fair test though. Perhaps try the 100M bz2 that is the sources?

      The issue of 32-bit rzip and 900MB was simply the limitations of machines at the time Tridge was doing his thesis. 64 bit is the obvious thing to do these days, as well as much larger window. If I get cycles (ha!) I’d like to revisit it entirely: the table of backrefs and literal sizes should be compressed separately from the data itself, and rzip should never emit matches which are within the range of the backend. Also, LZMA2 should be used (that’s what xz uses I’m told).

      For the current kernel sources, xz wins, making it a good choice for the mirrors. But an rzip variant should still be able to do at least as well, and better as we get larger.

  5. I have just installed and tried lrzip and I am impressed. It gives much better results than all other compressors for large files that are difficult to compress as they already use some form of compression. For example:

    Uncompressed tar file with AVI files: 3206 GB
    gzip -9: 3179 GB
    bzip2 -9: 3178 GB
    xz: 3179 GB
    xz -9: 3175 GB
    7z -mx=9: 3180 GB
    lrzip -M: 2829 GB
    lrzip -l: 2840 GB
    lrzip -n: 2852 GB

    The times were over 40 min. for bzip2, xz and 7z, 22 min. for lrzip -M, less than 5 min. for lrzip -l and less than 2 min. for lrzip -n.

Comments are closed.