linux-kernel - Re: [PATCH v4] mm: add zblock allocator

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-Id: <9A7AB7E7-4369-456D-80E9-A6B35227C373@konsulko.se>
Date: Fri, 18 Apr 2025 09:43:26 +0200
From: Vitaly Wool <vitaly.wool@...sulko.se>
To: Johannes Weiner <hannes@...xchg.org>
Cc: linux-mm@...ck.org,
 akpm@...ux-foundation.org,
 linux-kernel@...r.kernel.org,
 Nhat Pham <nphamcs@...il.com>,
 Shakeel Butt <shakeel.butt@...ux.dev>,
 Igor Belousov <igor.b@...dev.am>
Subject: Re: [PATCH v4] mm: add zblock allocator



> On Apr 17, 2025, at 4:16 PM, Johannes Weiner <hannes@...xchg.org> wrote:
> 
> On Wed, Apr 16, 2025 at 10:10:23PM +0200, Vitaly wrote:
>> 
>> On Wednesday, April 16, 2025 at 2:09:12 pm +02:00, Johannes Weiner <hannes@...xchg.org> wrote:
>> 
>>>> zblock is also in most cases superior to zsmalloc with regard to
>>>> average performance and worst execution times, thus allowing for better
>>>> response time and real-time characteristics of the whole system.
>> 
>>>> Is there a reason not to use this allocation scheme in zsmalloc then?
>> 
>> Introducing such a scheme in zsmalloc is theoretically possible but
>> it appears to be more complicated than implementing it from scratch,
>> which is exactly what was done.
> 
> Sure, but having two options adds sizable complexity as well.
> 
>>> I'm curious what others think, but I'm still not convinced a second
>>> allocator makes sense. It's maintenance overhead, a permanent struggle
>>> to match feature parity, and it fragments development and testing base.
>> 
>>> Not long ago several slab allocators were removed for those
>>> reasons. Likewise, we just deleted zbud and z3fold because they didn't
>>> get any attention and bitrotted, but not before years of inflicting
>>> pain through the zpool interface, users accidentally making very
>>> suboptimal choices, reporting the same bugs over and over again etc.
>> 
>> I'm not sure what pain you are talking about.
> 
> I list them in the paragraph you're replying to, and I've previously
> outlined the burden on developers, maintainers, and admins to support
> multiple implementations of the same functionality.
> 
> There is a real cost to doing this that you seem to dismiss.

I don’t dismiss the cost of maintenance, I just don’t buy this “years of pain” point because it’s very subjective. There are many people who still think that using Linux is a pain. OTOH back in 2016-2017 mobile devices using z3fold had better response metrics than those using zsmalloc.
> 
>> There were reasons why z3fold and zbud existed. z3fold and zbud were
>> the ones that supported page reclaim, zsmalloc wasn't quite usable
>> with zswap until recently. When we did z3fold it was outperforming
>> zsmalloc.
> 
> We see a higher compression ratio than 3 on a wide variety of
> workloads, so I'm a little doubtful z3fold ever outperformed zsmalloc
> in general-purpose environments.

z3fold never outperformed zsmalloc in terms of allocation density but before zsmalloc got a *working* compaction implementation it was on par.
Besides, compression ratios higher than 3 had been pretty rare before zstd was introduced, and if you ever tried zstd on msm8009 which was the SoC for many mobile devices back then you’d have probably understood how pointless it was back then for most of the embedded world.

And at that time z3fold was indeed faster than zsmalloc on multi-core systems, and especially on big.LITTLE.

> 
> When Meta started using zswap, certainly zsmalloc was the only real
> contender. zbud's storage density was almost not worth the cost of
> compression. z3fold was also not dense enough, and we ran into
> stability issues and crashes. The page reclaim/writeback
> implementation was not very useful either - take a look at the history
> of changes from Nhat and Domenico. These weren't just issues specific
> to our usecases, but much more blatant "how was this ever supposed to
> work?" problems.
> 
> There is nothing wrong with the evolution from zbud to more
> sophisticated allocators. But there is a pretty mature and
> feature-rich allocator now, and that sets a floor on what new
> allocators need to support to be considered general-purpose.
> 
>> With that said, I didn't object to removing z3fold because I did
>> understand that it made no sense to keep it at that point.
> 
> But you're proposing to do the same thing again, when multiple people
> just got done phasing out and cleaning up your previous experiments.

Oh well, the conversation is taking a twist. Do you mean that any new submission is “the same thing again”? Or what exactly do you mean here?
 
> 
>>> I also don't buy the fragmentation argument. Even if you are better at
>>> packing during allocation time (although, citation needed), the
>>> workload can unmap swap pages such that they leave partial blocks
>>> behind indefinitely if you don't have block compaction.
>> 
>> We published Zswap/Zswapped values for zblock/zsmalloc after stress
>> loads and those were on par, basically.
> 
> Those are accounted in zswap, so unfortunately don't capture backend
> fragmentation. You'd need to implement some of the debugging features
> and memory counters that zsmalloc has in order to compare them.

zblock reports total_pages in an honest way, i. e. the amount of pages it allocated, no matter how full or empty these are, and the numbers clearly show there’s no substantial internal fragmentation for any real life workloads we could come up with. You can of course allocate a bunch of slots and then free every second one and not do anything else at all and then the fragmentation will be large, but this is not how zswap operates. 

FWIW it is possible to make zsmalloc constantly alternate between compaction and fragmentation but this will not mimic either zram or zswap operation too.


>>> Then there is the migration support, which you said is planned, but
>>> which would presumably require the same indirection between handle and
>>> the backing pages that zsmalloc has. How much will this eat into the
>>> performance advantage?
>> 
>> I don't think that will be necessary. We're working on supporting
>> GFP_MOVABLE and minimising high order allocations
> 
>>> I'd much rather you'd focus on making zsmalloc better. Improve the
>>> packing scheme, make expensive features optional/configurable etc.
>>> That would be easier on developers and users alike.
>> 
>> zblock's source code is almost 5x smaller in size than zsmalloc's
> 
> It's an apple-to-oranges comparison.
> 
> zsmalloc has memory pressure handling and a rich debugging
> infrastructure that was added over time based on what people thought
> necessary and useful from production experience.
> 
> Implement the same functionality in zblock and we can compare lines
> and performance.

You assume that e.g. memory pressure handing implemented in zsmalloc is necessary for zblock, which I don’t think is the case. 
Debug facilitating additions are good but I _really_ _doubt_ these will bloat zblock code by 5x.
> 
>> and yet zblock works better in many cases with just a few
>> bottlenecks. Why would you mind that we'd focus on making zblock
>> better instead and possibly retire zsmalloc when that mission is
>> accomplished, just like we retired z3fold a while ago?
> 
> You're proposing a significant, open-ended maintenance burden for
> everybody else. I'm just asking for some justification stronger than
> "the small subset of the backend allocator that we implemented is
> slightly faster in a limited number of benchmarks."

Obviously we will be maintaining and extending the code, we have already discussed that.

> 
> The fact that zstd - a very commonly used compressor - immediately
> surfaced bugs that made it *much slower* is not reassuring.

I would expect that someone stating that first would first ask what the problem actually was. :)

~Vitaly