linux-kernel - Re: [PATCH v4 00/22] slab: replace cpu (partial) slabs with sheaves

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <aewj4cm6qojpm25qbn5pf75jg3xdd5zue2t4lvxtvgjbhoc3rx@b5u5pysccldy>
Date: Thu, 29 Jan 2026 11:44:21 -0500
From: "Liam R. Howlett" <Liam.Howlett@...cle.com>
To: Hao Li <hao.li@...ux.dev>
Cc: Vlastimil Babka <vbabka@...e.cz>, Harry Yoo <harry.yoo@...cle.com>,
        Petr Tesarik <ptesarik@...e.com>, Christoph Lameter <cl@...two.org>,
        David Rientjes <rientjes@...gle.com>,
        Roman Gushchin <roman.gushchin@...ux.dev>,
        Andrew Morton <akpm@...ux-foundation.org>,
        Uladzislau Rezki <urezki@...il.com>,
        Suren Baghdasaryan <surenb@...gle.com>,
        Sebastian Andrzej Siewior <bigeasy@...utronix.de>,
        Alexei Starovoitov <ast@...nel.org>, linux-mm@...ck.org,
        linux-kernel@...r.kernel.org, linux-rt-devel@...ts.linux.dev,
        bpf@...r.kernel.org, kasan-dev@...glegroups.com,
        kernel test robot <oliver.sang@...el.com>, stable@...r.kernel.org,
        "Paul E. McKenney" <paulmck@...nel.org>
Subject: Re: [PATCH v4 00/22] slab: replace cpu (partial) slabs with sheaves

* Hao Li <hao.li@...ux.dev> [260129 11:07]:
> On Thu, Jan 29, 2026 at 04:28:01PM +0100, Vlastimil Babka wrote:
> > On 1/29/26 16:18, Hao Li wrote:
> > > Hi Vlastimil,
> > > 
> > > I conducted a detailed performance evaluation of the each patch on my setup.
> > 
> > Thanks! What was the benchmark(s) used?

Yes, Thank you for running the benchmarks!

> 
> I'm currently using the mmap2 test case from will-it-scale. The machine is still
> an AMD 2-socket system, with 2 nodes per socket, totaling 192 CPUs, with SMT
> disabled. For each test run, I used 64, 128, and 192 processes respectively.

What about the other tests you ran in the detailed evaluation, were
there other regressions?  It might be worth including the list of tests
that showed issues and some of the raw results (maybe at the end of your
email) to show what you saw more clearly.  I did notice you had done
this previously.

Was the regression in the threaded or processes version of mmap2?

> 
> > Importantly, does it rely on vma/maple_node objects?
> 
> Yes, this test primarily puts a lot of pressure on maple_node.
> 
> > So previously those would become kind of double
> > cached by both sheaves and cpu (partial) slabs (and thus hopefully benefited
> > more than they should) since sheaves introduction in 6.18, and now they are
> > not double cached anymore?
> 
> Exactly, since version 6.18, maple_node has indeed benefited from a dual-layer
> cache.
> 
> I did wonder if this isn't a performance regression but rather the
> performance returning to its baseline after removing one layer of caching.
> 
> However, verifying this idea would require completely disabling the sheaf
> mechanism on version 6.19-rc5 while leaving the rest of the SLUB code untouched.
> It would be great to hear any suggestions on how this might be approached.

You could use perf record to capture the differences on the two kernels.
You could also user perf to look at the differences between three kernel
versions:
1. pre-sheaves entirely
2. the 'dual layer' cache
3. The final version

In these scenarios, it's not worth looking at the numbers, but just the
differences since the debug required to get meaningful information makes
the results hugely slow and, potentially, not as consistent.  Sometimes
I run them multiple time to ensure what I'm seeing makes sense for a
particular comparison (and the server didn't just rotate the logs or
whatever..)

> 
> > 
> > > During my tests, I observed two points in the series where performance
> > > regressions occurred:
> > > 
> > >     Patch 10: I noticed a ~16% regression in my environment. My hypothesis is
> > >     that with this patch, the allocation fast path bypasses the percpu partial
> > >     list, leading to increased contention on the node list.
> > 
> > That makes sense.
> > 
> > >     Patch 12: This patch seems to introduce an additional ~9.7% regression. I
> > >     suspect this might be because the free path also loses buffering from the
> > >     percpu partial list, further exacerbating node list contention.
> > 
> > Hmm yeah... we did put the previously full slabs there, avoiding the lock.
> > 
> > > These are the only two patches in the series where I observed noticeable
> > > regressions. The rest of the patches did not show significant performance
> > > changes in my tests.
> > > 
> > > I hope these test results are helpful.
> > 
> > They are, thanks. I'd however hope it's just some particular test that has
> > these regressions,
> 
> Yes, I hope so too. And the mmap2 test case is indeed quite extreme.
> 
> > which can be explained by the loss of double caching.
> 
> If we could compare it with a version that only uses the
> CPU partial list, the answer might become clearer.

In my experience, micro-benchmarks are good at identifying specific
failure points of a patch set, but unless an entire area of benchmarks
regress (ie all mmap threaded), then they rarely tell the whole story.

Are the benchmarks consistently slower?  This specific test is sensitive
to alignment because of the 128MB mmap/munmap operation.  Sometimes, you
will see a huge spike at a particular process/thread count that moves
around in tests like this.  Was your run consistently lower?

Thanks,
Liam