Rusty Russell's Coding Blog | Stealing From Smart People



Followup: lrzip

Mikael noted in my previous post that Con Kolivas’s lrzip is another interesting compressor.  In fact, Con has already done a simple 64-bit enhance of rzip for lrzip, and on our example file it gets 56M vs 55M for xz (lrzip in raw mode, followed by xz, gives 100k worse than just using lrzip: lrzip already uses lzma).

Assuming no bugs in rzip, the takeaway here is simple: rzip should not attempt to find matches within the range that the backend compressor (900k for bzip2 in rzip, 32k for gzip, megabytes for LZMA as used by lrzip).  The backend compressor will do a better job (as shown by similar results with lrzip when I increase the hash array size so it finds more matches: the resulting file is larger).

The rzip algorithm is good at finding matches over huge distances, and that is what it should stick to.  Huge here == size of file (rzip does not stream, for this reason).  And this implies only worrying about large matches over huge distances (the current 32 byte minimum is probably too small).  The current version of rzip uses an mmap window so it never has to seek, but this window is artificially limited to 900MB (or 60% of mem in lrzip).   If we carefully limit the number of comparisons with previous parts of the file, we may be able to reduce them to the point where we don’t confuse the readahead algorithms and thus get nice performance (fadvise may help here too) whether we are mmaped or seeking.

I like the idea that rzip should scale with the size of the file being compressed, not make assumptions about today’s memory sizes.  Though some kind of thrash detection using mincore might be necessary to avoid killing our dumb mm systems :(

RSS Feed

6 Comments for Followup: lrzip

Adam | February 16, 2010 at 2:31 pm

What about 7z which also uses LZMA. If I remember correctly it offer pretty good compressions ratios. The 7-Zip compressor/decompressor is covered by the GPL and I’m pretty sure it is completely unencumbered by patents.

Adam Kennedy | February 16, 2010 at 7:11 pm

Another suggestion for you to experiment with.

Tar the files sorted by file extension instead of file name (if you don’t already).

When we did this for the Strawberry Perl MSI installer, we saw a 10% reduction in file size, on the assumption that files of a similar file extension probably have similar content, and thus grouping them will feed the compression algorithm better.

Author comment by rusty | February 17, 2010 at 11:14 am

Interesting! OK, so I tested that here:

$ find linux-2.6 -name .git -prune -o -print | sed 's,\([./][^./]*$\),\1 \1,' | sort -k2 | cut -d\ -f1 > /tmp/sorted-list
$ tar cf /tmp/linux.2.6.sorted.tar -T /tmp/sorted-list --no-recursion
$ ls -l /tmp/linux.2.6.tar /tmp/linux.2.6.sorted.tar
-rw-r--r-- 1 rusty rusty 395048960 2010-02-17 10:06 /tmp/linux.2.6.sorted.tar
-rw-r--r-- 1 rusty rusty 395048960 2010-02-16 09:03 /tmp/linux.2.6.tar

OK, now the results for each compressor (-9 –to-stdout or -M for lrzip):

gzip /tmp/linux.2.6.tar: 86388655
gzip /tmp/linux.2.6.sorted.tar: 84904755
bzip2 /tmp/linux.2.6.tar: 67350827
bzip2 /tmp/linux.2.6.sorted.tar: 66784160
xz /tmp/linux.2.6.tar: 54614828
xz /tmp/linux.2.6.sorted.tar: 55100332
lrzip /tmp/linux.2.6.tar: 56500239
lrzip /tmp/linux.2.6.sorted.tar: 56104936

The xz result is flat-out weird; AFAICT any attempt to help xz with compression makes it worse. I wonder if it was actually optimized using the linux kernel source?

(PS. Added comment preview, since it was annoying me).

Con Kolivas | February 17, 2010 at 7:50 pm

Hi Rusty.

lrzip only becomes useful once the simple long distance dictionary rzip stage is significantly larger than the compression window of the backend (lzma). I doubt you’ll find the rzip first stage helps the kernel source at its current size when lzma’s compression windows are the size they already are. You’ll only start getting a benefit with a much larger codebase… which will be in not so long for linux kernel :P . If you see the benchmarks I posted, lrzip only has an advantage with massive files (and massive ram) when there is long distance redundancy. So if you start including multiple trees instead of a single kernel version, you’ll start seeing a benefit from the rzip stage. By a (not so strange) coincidence I also used kernel trees to test it.

As for bugs in rzip, I found it spat the dummy on greater than 32bit signed in multiple ways and suspect that’s why it had the artificial 900MB limit. Dealing with 32bit, mmap, write() limitations and so on has been a bit of a pain when trying to scale to 64bit sizes I have to admit. That’s been the bulk of the work that went into lrzip. In the process it actually made the compression a little less efficient and slower than rzip due to the 64bit offsets in the archives. xz’s performance on massive files was not so great, being both much slower than 7z (due to not being multithreaded) and not even compressing as well as it. I often wonder if the ground swell for xz on linux is well founded, but then there’s always more to software selection than merit on purely technical grounds. By that I don’t just mean politically…

Author comment by rusty | February 21, 2010 at 4:34 am

Using back-to-back kernel sources isn’t a fair test though. Perhaps try the 100M bz2 that is the sources?

The issue of 32-bit rzip and 900MB was simply the limitations of machines at the time Tridge was doing his thesis. 64 bit is the obvious thing to do these days, as well as much larger window. If I get cycles (ha!) I’d like to revisit it entirely: the table of backrefs and literal sizes should be compressed separately from the data itself, and rzip should never emit matches which are within the range of the backend. Also, LZMA2 should be used (that’s what xz uses I’m told).

For the current kernel sources, xz wins, making it a good choice for the mirrors. But an rzip variant should still be able to do at least as well, and better as we get larger.

Adrian | March 12, 2010 at 5:01 am

I have just installed and tried lrzip and I am impressed. It gives much better results than all other compressors for large files that are difficult to compress as they already use some form of compression. For example:

Uncompressed tar file with AVI files: 3206 GB
gzip -9: 3179 GB
bzip2 -9: 3178 GB
xz: 3179 GB
xz -9: 3175 GB
7z -mx=9: 3180 GB
lrzip -M: 2829 GB
lrzip -l: 2840 GB
lrzip -n: 2852 GB

The times were over 40 min. for bzip2, xz and 7z, 22 min. for lrzip -M, less than 5 min. for lrzip -l and less than 2 min. for lrzip -n.



Find it!

Theme Design by

Tag Cloud