[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <85d872a3-8192-4668-b5c4-c81ffadc74da@suse.cz>
Date: Tue, 27 Jan 2026 23:04:52 +0100
From: Vlastimil Babka <vbabka@...e.cz>
To: Mateusz Guzik <mjguzik@...il.com>
Cc: Harry Yoo <harry.yoo@...cle.com>, Petr Tesarik <ptesarik@...e.com>,
Christoph Lameter <cl@...two.org>, David Rientjes <rientjes@...gle.com>,
Roman Gushchin <roman.gushchin@...ux.dev>, Hao Li <hao.li@...ux.dev>,
Andrew Morton <akpm@...ux-foundation.org>,
Uladzislau Rezki <urezki@...il.com>,
"Liam R. Howlett" <Liam.Howlett@...cle.com>,
Suren Baghdasaryan <surenb@...gle.com>,
Sebastian Andrzej Siewior <bigeasy@...utronix.de>,
Alexei Starovoitov <ast@...nel.org>, linux-mm@...ck.org,
linux-kernel@...r.kernel.org, linux-rt-devel@...ts.linux.dev,
bpf@...r.kernel.org, kasan-dev@...glegroups.com
Subject: Re: [PATCH v4 18/22] slab: refill sheaves from all nodes
On 1/27/26 15:28, Mateusz Guzik wrote:
> On Fri, Jan 23, 2026 at 07:52:56AM +0100, Vlastimil Babka wrote:
>> __refill_objects() currently only attempts to get partial slabs from the
>> local node and then allocates new slab(s). Expand it to trying also
>> other nodes while observing the remote node defrag ratio, similarly to
>> get_any_partial().
>>
>> This will prevent allocating new slabs on a node while other nodes have
>> many free slabs. It does mean sheaves will contain non-local objects in
>> that case. Allocations that care about specific node will still be
>> served appropriately, but might get a slowpath allocation.
>
> While I can agree pulling memory from other nodes is necessary in some
> cases, I believe the patch as proposed is way too agressive and the
> commit message does not justify it.
OK it's not elaborated on much, but "similarly to get_any_partial()" means
we try to behave similarly to how this was handled before sheaves, where the
very same decisions were used to obtain cpu (partial) slabs from the remote
node.
The reason is that the bots can then hopefully compare before/after sheaves
based on the real differences between those caching approaches, and not such
subtle side-effects as different numa tradeoffs.
But for bisecting performance regressions, it seems it was a mistake that I
did this part as a standalone patch and not immediately as part of patch 10
- because it was already doing too much.
> Interestingly there were already reports concerning this, for example:
> https://lore.kernel.org/oe-lkp/202601132136.77efd6d7-lkp@intel.com/T/#u
>
> quoting:
> * [vbabka:b4/sheaves-for-all-rebased] [slab] aa8fdb9e25: will-it-scale.per_process_ops 46.5% regression
And that's the problem as it's showing before/after this commit only. But it
should also mean that patch 10 could have improved things by effectively
removing the remote numa refill aspect temporarily. Maybe it was too noisy
for a benefit report. It would be interesting to see the before/after whole
series.
> The system at hand has merely 2 nodes and it already got:
>
> %stddev %change %stddev
> \ | \
> 7274 ± 13% -27.0% 5310 ± 16% perf-c2c.DRAM.local
> 1458 ± 14% +272.3% 5431 ± 10% perf-c2c.DRAM.remote
> 77502 ± 9% -58.6% 32066 ± 11% perf-c2c.HITM.local
> 150.83 ± 12% +2150.3% 3394 ± 12% perf-c2c.HITM.remote
> 77653 ± 9% -54.3% 35460 ± 10% perf-c2c.HITM.total
>
> As in a significant increase in traffic.
I however doubt the regression would be so severe if this was only about "we
allocated more remote objects so we are now accessing them more slower". But
more on that later.
> Things have to be way worse on systems with 4 and more nodes.
>
> This is not a microbenchmark-specific problem either -- any cache miss
> on memory allocated like that induces interconnect traffic. That's a
> real slowdown in real workloads.
Sure, but that bad?
> Admittedly I don't know what the policy is at the moment, it may be
> things already suck.
As I was saying, basically the same as before sheaves, just via different
caching mechanism.
BTW there's a tunable for this -
/sys/kernel/slab/xx/remote_node_defrag_ratio
> A basic test for sanity is this: suppose you have a process whose all
> threads are bound to one node. absent memory shortage in the local
> node and allocations which somehow explicitly request a different node,
> is it going to get local memory from kmalloc et al?
All memory local? Not guaranteed.
> To my understanding with the patch at hand the answer is no.
Which is not a new thing.
> Then not only this particular process is penalized for its lifetime, but
> everything else is penalized on top -- even ignoring straight up penalty
> for interconnect traffic, there is only so much it can handle to begin
> with.
>
> Readily usable slabs in other nodes should be of no significance as long
> as there are enough resources locally.
Note that in general this approach can easily bite us in the end, as when
there are no more enough resources locally, it might be too late. Not
completely fitting example, but see
https://lore.kernel.org/all/20251219-costly-noretry-thisnode-fix-v1-1-e1085a4a0c34@suse.cz/
> If you are looking to reduce total memory usage, I would instead check
> how things work out for resuing the same backing pages for differently
> sizes objects (I mean is it even implemented?) and would investigate if
This would be too complex and contrary to the basic slab design.
> additional kmalloc slab sizes would help -- there are power-of-2 jumps
> all the way to 8k. Chances are decent sizes like 384 and 768 bytes would
> in fact drop real memory requirement.
I don't think it's about trading off minimizing memory requirements
elsewhere to allow excessive per-node waste here. Sure we can tune the
decisions here to only go for remote nodes when the amount of slabs there is
more out of balance than currently, etc. But we should not eliminate it
completely.
> iow, I think this patch should be dropped at least for the time being
Because it's not introducing new behavior, I think it shouldn't.
However I think I found a possible improvement that should not be a tradeoff
but a reasonable win. Because I noticed in the profiles also:
54.93 +17.5 72.46 perf-profile.self.cycles-pp.native_queued_spin_lock_slowpath
And part of it is likely due to contending on the list_lock due to the
remote refills. So we could make those trylock only and see if it helps.
----8<----
>From 5ac96a0bde0c3ea5cecfb4e478e49c9f6deb9c19 Mon Sep 17 00:00:00 2001
From: Vlastimil Babka <vbabka@...e.cz>
Date: Tue, 27 Jan 2026 22:40:26 +0100
Subject: [PATCH] slub: avoid list_lock contention from __refill_objects_any()
Signed-off-by: Vlastimil Babka <vbabka@...e.cz>
---
mm/slub.c | 19 +++++++++++++------
1 file changed, 13 insertions(+), 6 deletions(-)
diff --git a/mm/slub.c b/mm/slub.c
index 7d7e1ae1922f..3458dfbab85d 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -3378,7 +3378,8 @@ static inline bool pfmemalloc_match(struct slab *slab, gfp_t gfpflags);
static bool get_partial_node_bulk(struct kmem_cache *s,
struct kmem_cache_node *n,
- struct partial_bulk_context *pc)
+ struct partial_bulk_context *pc,
+ bool allow_spin)
{
struct slab *slab, *slab2;
unsigned int total_free = 0;
@@ -3390,7 +3391,10 @@ static bool get_partial_node_bulk(struct kmem_cache *s,
INIT_LIST_HEAD(&pc->slabs);
- spin_lock_irqsave(&n->list_lock, flags);
+ if (allow_spin)
+ spin_lock_irqsave(&n->list_lock, flags);
+ else if (!spin_trylock_irqsave(&n->list_lock, flags))
+ return false;
list_for_each_entry_safe(slab, slab2, &n->partial, slab_list) {
struct freelist_counters flc;
@@ -6544,7 +6548,8 @@ EXPORT_SYMBOL(kmem_cache_free_bulk);
static unsigned int
__refill_objects_node(struct kmem_cache *s, void **p, gfp_t gfp, unsigned int min,
- unsigned int max, struct kmem_cache_node *n)
+ unsigned int max, struct kmem_cache_node *n,
+ bool allow_spin)
{
struct partial_bulk_context pc;
struct slab *slab, *slab2;
@@ -6556,7 +6561,7 @@ __refill_objects_node(struct kmem_cache *s, void **p, gfp_t gfp, unsigned int mi
pc.min_objects = min;
pc.max_objects = max;
- if (!get_partial_node_bulk(s, n, &pc))
+ if (!get_partial_node_bulk(s, n, &pc, allow_spin))
return 0;
list_for_each_entry_safe(slab, slab2, &pc.slabs, slab_list) {
@@ -6650,7 +6655,8 @@ __refill_objects_any(struct kmem_cache *s, void **p, gfp_t gfp, unsigned int min
n->nr_partial <= s->min_partial)
continue;
- r = __refill_objects_node(s, p, gfp, min, max, n);
+ r = __refill_objects_node(s, p, gfp, min, max, n,
+ /* allow_spin = */ false);
refilled += r;
if (r >= min) {
@@ -6691,7 +6697,8 @@ refill_objects(struct kmem_cache *s, void **p, gfp_t gfp, unsigned int min,
return 0;
refilled = __refill_objects_node(s, p, gfp, min, max,
- get_node(s, local_node));
+ get_node(s, local_node),
+ /* allow_spin = */ true);
if (refilled >= min)
return refilled;
--
2.52.0
Powered by blists - more mailing lists