linux-kernel - Re: [PATCH v4] mm: add zblock allocator

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <20250417141656.GC780688@cmpxchg.org>
Date: Thu, 17 Apr 2025 10:16:56 -0400
From: Johannes Weiner <hannes@...xchg.org>
To: Vitaly <vitaly.wool@...sulko.se>
Cc: linux-mm@...ck.org, akpm@...ux-foundation.org,
	linux-kernel@...r.kernel.org, Nhat Pham <nphamcs@...il.com>,
	Shakeel Butt <shakeel.butt@...ux.dev>,
	Igor Belousov <igor.b@...dev.am>
Subject: Re: [PATCH v4] mm: add zblock allocator

On Wed, Apr 16, 2025 at 10:10:23PM +0200, Vitaly wrote:
> 
> On Wednesday, April 16, 2025 at 2:09:12 pm +02:00, Johannes Weiner <hannes@...xchg.org> wrote:
> 
> >> zblock is also in most cases superior to zsmalloc with regard to
> >> average performance and worst execution times, thus allowing for better
> >> response time and real-time characteristics of the whole system.
> 
> >> Is there a reason not to use this allocation scheme in zsmalloc then?
> 
> Introducing such a scheme in zsmalloc is theoretically possible but
> it appears to be more complicated than implementing it from scratch,
> which is exactly what was done.

Sure, but having two options adds sizable complexity as well.

> > I'm curious what others think, but I'm still not convinced a second
> > allocator makes sense. It's maintenance overhead, a permanent struggle
> > to match feature parity, and it fragments development and testing base.
> 
> > Not long ago several slab allocators were removed for those
> > reasons. Likewise, we just deleted zbud and z3fold because they didn't
> > get any attention and bitrotted, but not before years of inflicting
> > pain through the zpool interface, users accidentally making very
> > suboptimal choices, reporting the same bugs over and over again etc.
> 
> I'm not sure what pain you are talking about.

I list them in the paragraph you're replying to, and I've previously
outlined the burden on developers, maintainers, and admins to support
multiple implementations of the same functionality.

There is a real cost to doing this that you seem to dismiss.

> There were reasons why z3fold and zbud existed. z3fold and zbud were
> the ones that supported page reclaim, zsmalloc wasn't quite usable
> with zswap until recently. When we did z3fold it was outperforming
> zsmalloc.

We see a higher compression ratio than 3 on a wide variety of
workloads, so I'm a little doubtful z3fold ever outperformed zsmalloc
in general-purpose environments.

When Meta started using zswap, certainly zsmalloc was the only real
contender. zbud's storage density was almost not worth the cost of
compression. z3fold was also not dense enough, and we ran into
stability issues and crashes. The page reclaim/writeback
implementation was not very useful either - take a look at the history
of changes from Nhat and Domenico. These weren't just issues specific
to our usecases, but much more blatant "how was this ever supposed to
work?" problems.

There is nothing wrong with the evolution from zbud to more
sophisticated allocators. But there is a pretty mature and
feature-rich allocator now, and that sets a floor on what new
allocators need to support to be considered general-purpose.

> With that said, I didn't object to removing z3fold because I did
> understand that it made no sense to keep it at that point.

But you're proposing to do the same thing again, when multiple people
just got done phasing out and cleaning up your previous experiments.

> > I also don't buy the fragmentation argument. Even if you are better at
> > packing during allocation time (although, citation needed), the
> > workload can unmap swap pages such that they leave partial blocks
> > behind indefinitely if you don't have block compaction.
> 
> We published Zswap/Zswapped values for zblock/zsmalloc after stress
> loads and those were on par, basically.

Those are accounted in zswap, so unfortunately don't capture backend
fragmentation. You'd need to implement some of the debugging features
and memory counters that zsmalloc has in order to compare them.

> > Then there is the migration support, which you said is planned, but
> > which would presumably require the same indirection between handle and
> > the backing pages that zsmalloc has. How much will this eat into the
> > performance advantage?
> 
> I don't think that will be necessary. We're working on supporting
> GFP_MOVABLE and minimising high order allocations

> > I'd much rather you'd focus on making zsmalloc better. Improve the
> > packing scheme, make expensive features optional/configurable etc.
> > That would be easier on developers and users alike.
> 
> zblock's source code is almost 5x smaller in size than zsmalloc's

It's an apple-to-oranges comparison.

zsmalloc has memory pressure handling and a rich debugging
infrastructure that was added over time based on what people thought
necessary and useful from production experience.

Implement the same functionality in zblock and we can compare lines
and performance.

> and yet zblock works better in many cases with just a few
> bottlenecks. Why would you mind that we'd focus on making zblock
> better instead and possibly retire zsmalloc when that mission is
> accomplished, just like we retired z3fold a while ago?

You're proposing a significant, open-ended maintenance burden for
everybody else. I'm just asking for some justification stronger than
"the small subset of the backend allocator that we implemented is
slightly faster in a limited number of benchmarks."

The fact that zstd - a very commonly used compressor - immediately
surfaced bugs that made it *much slower* is not reassuring.

The freelist issues that came up suggest strongly that you haven't
looked too closely at zsmalloc and actually tried to find out why it
does things the way it does. Which in turn suggests to me that this is
not going to be the only cornercase lesson that zblock will go through
for things that have been addressed in zsmalloc a long time ago.

All new research is expected to address prior work in its space. A new
allocator should at least come with some analysis of where exactly the
current allocator is flawed and why fixing that would likely amount to
an entire rewrite anyway. Or be a more realistic drop-in replacement
for the existing allocator already.

Making incremental improvements is the default. Forking needs a very
good reason.