lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <ed517e4a-70db-4e80-9fbe-b1a4cfe3a11c@konsulko.se>
Date: Wed, 9 Apr 2025 01:12:38 +0200
From: Vitaly Wool <vitaly.wool@...sulko.se>
To: Nhat Pham <nphamcs@...il.com>
Cc: Johannes Weiner <hannes@...xchg.org>, Igor Belousov <igor.b@...dev.am>,
 linux-mm@...ck.org, akpm@...ux-foundation.org, linux-kernel@...r.kernel.org,
 Shakeel Butt <shakeel.butt@...ux.dev>, Yosry Ahmed <yosryahmed@...gle.com>
Subject: Re: [PATCH v2] mm: add zblock allocator


>>> So zstd results in nearly double the compression ratio, which in turn
>>> cuts total execution time *almost in half*.
>>>
>>> The numbers speak for themselves. Compression efficiency >>> allocator
>>> speed, because compression efficiency ultimately drives the continuous
>>> *rate* at which allocations need to occur. You're trying to optimize a
>>> constant coefficient at the expense of a higher-order one, which is a
>>> losing proposition.
>>
>> Well, not really. This is an isolated use case with
>> a. significant computing power under the hood
>> b. relatively few cores
>> c. relatively short test
>> d. 4K pages
>>
>> If any of these isn't true, zblock dominates.
>> !a => zstd is too slow
>> !b => parallelization gives more effect
>> !c => zsmalloc starts losing due to having to deal with internal
>> fragmentation
>> !d => compression efficiency of zblock is better.
>>
>> Even !d alone makes zblock a better choice for ARM64 based servers.
>>
>> ~Vitaly
> 
> Could you expand on each point? And do you have data to show this?
> 
> For b, we run zswap + zsmalloc on hosts with hundreds of cores, and
> have not found zsmalloc to be a noticeable bottleneck yet, FWIW.

I don't have the numbers at hand, I think Igor will be able to provide 
those tomorrow.

> For c - in longer runs, how does zblock perform better than zsmalloc?
> In fact, my understanding is that zsmalloc does compaction, which
> should help with internal fragmentation over time. zblock doesn't seem
> to do this, or maybe I missed it?

The thing is, zblock doesn't have to. Imagine a street with cars parked 
at side. If you have cars of different lengths which drive in and out, 
you'll end up with spaces in between that longer cars won't be able to 
squeeze in to. This is why zsmalloc does compaction.

Now for zblock you can say that only same length cars are allowed to 
park on one street and therefore that street is either full or you will 
have a place.

> For d too. I see that you hard code special configurations for zblock
> blocks in the case of 0x4000 page size, but how does that help with
> compression efficiency?

Well, to be able to answer that I need to dig more into zsmalloc 
operation, but i would guess that zsmalloc's chunks are just multiplied 
by 4 in case of 16K page and thus you lose all the granularity you used 
to have, but I'm not completely certain.

Meanwhile I did a quick measurement run with zblock and zsmalloc on a 
Raspberry Pi 5 (native kernel build test) with zstd as the compression 
backend and the results are the following:

1. zsmalloc
*** The build was OOM killed ***
real    26m58.876s
user    95m32.425s
sys     4m39.017s
Zswap:            250944 kB
Zswapped:         871536 kB
zswpin 108
zswpout 54473
663296  /mnt/tmp/build/

2. zblock
real    27m31.579s
user    96m42.845s
sys     4m40.464s
Zswap:             66592 kB
Zswapped:         563168 kB
zswpin 243
zswpout 35262
1423200 /mnt/tmp/build/

You can see by the size of the build folder that the first run was 
terminated prematurely not at all close to the end of it.

So, I can re-run the tests on 8-core high performance ARM64 with 16K 
pages tomorrow, but so far everything we have seen points in one 
direction: zblock is clearly superior to zsmalloc in 16K page configuration.

Besides, zblock can do even better if we extend that very hardcoded 
table you mentioned (and BTW, it can be automatically generated at init 
but I don't see the point in that).

~Vitaly

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ