linux-kernel - Re: [PATCH v4] mm: add zblock allocator

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <aBoI39N_mHx0Tlp-@google.com>
Date: Tue, 6 May 2025 13:04:31 +0000
From: Yosry Ahmed <yosry.ahmed@...ux.dev>
To: Vitaly Wool <vitaly.wool@...sulko.se>
Cc: linux-mm@...ck.org, akpm@...ux-foundation.org,
	linux-kernel@...r.kernel.org, Nhat Pham <nphamcs@...il.com>,
	Shakeel Butt <shakeel.butt@...ux.dev>,
	Johannes Weiner <hannes@...xchg.org>,
	Igor Belousov <igor.b@...dev.am>, Minchan Kim <minchan@...nel.org>,
	Sergey Senozhatsky <senozhatsky@...omium.org>
Subject: Re: [PATCH v4] mm: add zblock allocator

On Thu, May 01, 2025 at 02:41:29PM +0200, Vitaly Wool wrote:
> Hi Yosry,
> 
> On 4/30/25 14:27, Yosry Ahmed wrote:
> > On Wed, Apr 23, 2025 at 09:53:48PM +0200, Vitaly Wool wrote:
> > > On 4/22/25 12:46, Yosry Ahmed wrote:
> > > > I didn't look too closely but I generally agree that we should improve
> > > > zsmalloc where possible rather than add a new allocator. We are trying
> > > > not to repeat the zbud/z3fold or slub/slob stories here. Zsmalloc is
> > > > getting a lot of mileage from both zswap and zram, and is more-or-less
> > > > battle-tested. Let's work toward building upon that instead of starting
> > > > over.
> > > 
> > > The thing here is, zblock is using a very different approach to small object
> > > allocation. The idea is: we have an array of descriptors which correspond to
> > > multi-page blocks divided in chunks of equal size (block_size[i]). For each
> > > object of size x we find the descriptor n such as:
> > > 	block_size[n-1] < n < block_size[n]
> > > and then we store that object in an empty slot in one of the blocks. Thus,
> > > the density is high, the search is fast (rbtree based) and there are no
> > > objects spanning over 2 pages, so no extra memcpy involved.
> > 
> > The block sizes seem to be similar in principle to class sizes in
> > zsmalloc. It seems to me that there are two apparent differentiating
> > properties to zblock:
> > 
> > - Block lookup uses an rbtree, so it's faster than zsmalloc's list
> >    iteration. On the other hand, zsmalloc divides each class into
> >    fullness groups and tries to pack almost full groups first. Not sure
> >    if zblock's approach is strictly better.
> 
> If we free a slot in a fully packed block we put it on top of the list.
> zswap's normal operation pattern is that there will be more free slots in
> that block so it's roughly the same.

How so? IIUC the order in which slots are freed depends on the LRU (for
writeback) and swapins (for loads). Why do we expect that slots from the
same block will be freed in close succession?

> 
> > - Zblock uses higher order allocations vs. zsmalloc always using order-0
> >    allocations. I think this may be the main advantage and I remember
> >    asking if zsmalloc can support this. Always using order-0 pages is
> >    more reliable but may not always be the best choice.
> 
> There's a patch we'll be posting soon with "opportunistic" high order
> allocations (i. e. if try_alloc_pages fails, allocate order-0 pages
> instead). This will leverage the benefits of higher order allocations
> without putting too much stress on the system.
> 
> > On the other hand, zblock is lacking in other regards. For example:
> > - The lack of compaction means that certain workloads will see a lot of
> >    fragmentation. It purely depends on the access patterns. We could end
> >    up with a lot of blocks each containing a single object and there is
> >    no way to recover AFAICT.
> 
> We have been giving many variants of stress load on the memory subsystem and
> the worst compression ratio *after* the stress load was 2.8x using zstd as
> the compressor (and about 4x under load). With zsmalloc under the same
> conditions the ratio was 3.6x after and 4x under load.
> 
> With more normal (but still stressing) usage patterns the numbers *after*
> the stress load were around 3.8x and 4.1x, respectively.
> 
> Bottom line, ending up with a lot of blocks each containing a single object
> is not a real life scenario. With that said, we have a quite simple solution
> in the making that will get zblock on par with zsmalloc even in the cases
> described above.

Could you share a high-level description of how this issue will be
addressed in zblock? I am trying to understand why/how zblock can handle
this better/simpler than zsmalloc.

> 
> > - Zblock will fail if a high order allocation cannot be satisfied, which
> >    is more likely to happen under memory pressure, and it's usually when
> >    zblock is needed in the first place.
> 
> See above, this issue will be addressed in the patch coming in a really
> short while.
> 
> > - There's probably more, I didn't check too closely, and I am hoping
> >    that Minchan and Sergey will chime in here.
> > 
> > > 
> > > And with the latest zblock, we see that it has a clear advantage in
> > > performance over zsmalloc, retaining roughly the same allocation density for
> > > 4K pages and scoring better on 16K pages. E. g. on a kernel compilation:
> > > 
> > > * zsmalloc/zstd/make -j32 bzImage
> > > 	real	8m0.594s
> > > 	user	39m37.783s
> > > 	sys	8m24.262s
> > > 	Zswap:            200600 kB <-- after build completion
> > > 	Zswapped:         854072 kB <-- after build completion
> > > 	zswpin 309774
> > > 	zswpout 1538332
> > > 
> > > * zblock/zstd/make -j32 bzImage
> > > 	real	7m35.546s
> > > 	user	38m03.475s
> > > 	sys	7m47.407s
> > > 	Zswap:            250940 kB <-- after build completion
> > > 	Zswapped:         870660 kB <-- after build completion
> > > 	zswpin 248606
> > > 	zswpout 1277319
> > > 
> > > So what we see here is that zblock is definitely faster and at least not
> > > worse with regard to allocation density under heavy load. It has slightly
> > > worse _idle_ allocation density but since it will quickly catch up under
> > > load it is not really important. What is important is that its
> > > characteristics don't deteriorate over time. Overall, zblock is simple and
> > > efficient and there is /raison d'etre/ for it.
> > 
> > Zblock is performing better for this specific workload, but as I
> > mentioned earlier there are other aspects that zblock is missing.
> > Zsmalloc has seen a very large range of workloads of different types,
> > and we cannot just dismiss this.
> 
> We've been running many different work loads with both allocators but
> posting all the results in the patch description will go well beyond the
> purpose of a patch submission. If there are some workloads you are
> interested in in particular, please let me know, odds are high we have some
> results for those too.

That's good to know. I don't have specific workloads in mind, was just
stating the fact that zsmalloc has been tested with a variety of
workloads in production environments.

> 
> > > Now, it is indeed possible to partially rework zsmalloc using zblock's
> > > algorithm but this will be a rather substantial change, equal or bigger in
> > > effort to implementing the approach described above from scratch (and this
> > > is what we did), and with such drastic changes most of the testing that has
> > > been done with zsmalloc would be invalidated, and we'll be out in the wild
> > > anyway. So even though I see your point, I don't think it applies in this
> > > particular case.
> > 
> > 
> > Well, we should start by breaking down the differences and finding out
> > why zblock is performing better, as I mentioned above. If it's the
> > faster lookups or higher order allocations, we can work to support that
> > in zsmalloc. Similarly, if zsmalloc has unnecessary complexity it'd be
> > great to get rid of it rather than starting over.
> > 
> > Also, we don't have to do it all at once and invalidate the testing that
> > zsmalloc has seen. These can be incremental changes that get spread over
> > multiple releases, getting incremental exposure in the process.
> 
> I believe we are a lot closer now to having a zblock without the initial
> drawbacks you have pointed out than a faster zsmalloc, retaining the code
> simplicity of the former.

This does not answer my question tho. I am trying to understand what
makes zblock faster than zsmalloc.

If it's the (optionally opportunistic) higher order allocations, we can
probably do the same in zsmalloc and avoid memcpys as well. I guess we
can try to allocate zspages as a high-order contiguous allocation first,
and fallback to the current "chain" structure on failure.

If it's the lack of overhead from compaction (e.g. less locking), then
the question should be why does zsmalloc need compaction and zblock
allegedly doesn't?

I am not arguing that you're seeing better results from zblock, what I
am saying is let's pinpoint what makes zblock better first, then decide
whether the same can be applied to zsmalloc or if it's really better to
create a new allocator instead. Right now we don't have full
information.

WDYT?