linux-kernel - Linux kernel compression with lrzip

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Date:	Sun, 7 Nov 2010 22:59:25 +1100
From:	Con Kolivas <kernel@...ivas.org>
To:	linux-kernel@...r.kernel.org
Subject: Linux kernel compression with lrzip

Hi all

Let's do this backwards. Data first.


tarball of every 3 point linux kernel from 2.6.0 to 2.6.36

9748285440 linux-2.6.0-2.6.36.tar


compressed file

136516013 linux-2.6.0-2.6.36.tar.lrz


Compressed size: 1.4%


Compression time:

00:19:22.086


Decompression time:

00:03:02.564


http://lrzip.kolivas.org


Now for the introduction

Lrzip is a compression application I've been working on that is based on the 
original excellent application rzip by Andrew Tridgell. Rzip worked by having 
a preprocessor stage with a massive compression window up to 900MB for long 
distance redundancy compaction and then compressing the data with bz2. Lrzip 
was a project I began based on rzip which tried to extend the initial 
compression window beyond 900MB and to use lzma as the back end for the 2nd 
stage compression. The idea was that as file sizes got bigger, and machines 
had more ram, it would keep getting more and more useful.

After many iterations on lrzip, I've been able to significantly expand on this 
idea and make it address 64 bit sized files, ram sizes, and bigger windows. 
Previously the limit was based on available ram and how much of the file being 
compressed could be mmapped. The current version (0.5.2) is able to use 
unlimited sized compression windows for the preprocessor stage using two 
sliding mmap windows I invented to match the hash search algorithm of rzip 
which can go beyond the size of the ram available. Basically the larger the 
file, and the more redundant information in the file, the more the compression 
is you can achieve.

Anyway the relevance of this to kernel folk is that the linux kernel is a 
perfect test case for long distance redundancy when multiple kernel trees are 
compressed.

So here's the lowdown. All of these results were conducted on a 3GHz quad core 
64 bit machine with 8GB ram on a 160GB SSD so hard drive speed is a 
(relatively) insignificant part of the times.


Starting with linux kernel 2.6.36

413511680 linux-2.6.36.tar
70277083 linux-2.6.36.tar.bz2
59905365 linux-2.6.36.tar.lrz

Compression:
Compression Ratio: 6.903. Average Compression Speed:  1.582MB/s.           
Total time: 00:04:08.896

Decompression:
Average DeCompression Speed: 17.130MB/s
Total time: 00:00:22.557


So trying my best to show off what lrzip is good at, I grabbed every 3 point 
2.6 linux kernel release. That's every release from 2.6.0 to 2.6.36 as a 
tarball, not as patches. It came to a 9.1GB file. Now each kernel tree is 
going to be repeating an -awful- lot of data, but given that they have gotten 
progressively larger, and span gigabytes, normal compression doesn't really 
offer much. This is where the rzip compaction stage over the full size of the 
file comes in handy. The more redundant information there is over large 
distances, the better. [insert joke about the linux kernel and redundant 
information here]. (-MU means Maximum Unlimited; maximum ram and unlimited 
window size).


9748285440 linux-2.6.0-2.6.36.tar

lrzip -MU linux-2.6.0-2.6.36.tar

136516013 linux-2.6.0-2.6.36.tar.lrz

Compression:
Compression Ratio: 71.408. Average Compression Speed:  8.000MB/s.
Total time: 00:19:22.086

That's 9.1GB to 130MB (1.4%) in 20 minutes.

Decompression:
Average DeCompression Speed: 51.359MB/s
Total time: 00:03:02.564


At 130MB, it's small enough for me to even offer the entire 3 point stable 
release for download from my piddly little server. So here's the file, if 
anyone wanted to confirm its validity, or just to download them all :)

http://ck.kolivas.org/linux-2.6.0-2.6.36.tar.lrz


lrzip also has an lzo and zpaq backend for those who want to extract every 
last drop of compression out of it at the cost of a bit more time. zpaq is 
EIGHT TIMES slower during the backend compression phase compared to lzma 
whereas lzo is almost instant after the rzip phase. Note that this file loses 
most of its size during the preprocessor stage so it doesn't end up taking 8x 
longer to compress with zpaq:


lrzip -MUz linux-2.6.0-2.6.36.tar

121021214 linux-2.6.0-2.6.36.tar.lrz

Compression:
Compression Ratio: 80.550. Average Compression Speed:  3.041MB/s.
Total time: 00:50:56.537

That's 9.1GB to 115MB (1.2%) in 51 mins.

Decompression time is about 52 minutes (yes it's longer than compression!)


Now, I am NOT proposing lrzip be used as a drop in replacement for the 
existing compression formats in use. It does work on stdin/stdout but 
inefficiently except on compression from stdin. It does not have any archive 
facilities beyond compressing the one file, so it comes with an lrztar and 
lrzuntar wrapper to do basic directory archiving. It uses a LOT of ram, and 
although it has the ability to use an unlimited sized compression windows 
unconstrained by ram, the bigger the discrepancy between the file size and the 
ram, the slower it will get. (Decompression typically uses 10x less ram than 
compression).

Nevertheless, I home some people with lots of similar files lying around, like 
kernel hackers, will hopefully find it a useful tool, as will people with 
database dumps and the like. I also wonder if features of this compression 
approach might be helpful for other projects that deal with huge amounts of 
data and have large memory requirements. When I mentioned the sliding mmap 
feature online, git using lots of memory during its compression phase was 
mentioned, so perhaps there are other uses for the sliding mmap idea as a way 
of reducing memory usage. There's some documentation of it in the code in 
rzip.c and I wrote briefly about it in my blog: http://ck-hack.blogspot.com


Enjoy!
-- 
-ck
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/