[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <cburjqy3r73ojiaathpxwayvq7up263m3lvrikicrkkybdj2iz@vefohvamiqr4>
Date: Tue, 27 Jan 2026 15:28:53 +0100
From: Mateusz Guzik <mjguzik@...il.com>
To: Vlastimil Babka <vbabka@...e.cz>
Cc: Harry Yoo <harry.yoo@...cle.com>, Petr Tesarik <ptesarik@...e.com>,
Christoph Lameter <cl@...two.org>, David Rientjes <rientjes@...gle.com>,
Roman Gushchin <roman.gushchin@...ux.dev>, Hao Li <hao.li@...ux.dev>,
Andrew Morton <akpm@...ux-foundation.org>, Uladzislau Rezki <urezki@...il.com>,
"Liam R. Howlett" <Liam.Howlett@...cle.com>, Suren Baghdasaryan <surenb@...gle.com>,
Sebastian Andrzej Siewior <bigeasy@...utronix.de>, Alexei Starovoitov <ast@...nel.org>, linux-mm@...ck.org,
linux-kernel@...r.kernel.org, linux-rt-devel@...ts.linux.dev, bpf@...r.kernel.org,
kasan-dev@...glegroups.com
Subject: Re: [PATCH v4 18/22] slab: refill sheaves from all nodes
On Fri, Jan 23, 2026 at 07:52:56AM +0100, Vlastimil Babka wrote:
> __refill_objects() currently only attempts to get partial slabs from the
> local node and then allocates new slab(s). Expand it to trying also
> other nodes while observing the remote node defrag ratio, similarly to
> get_any_partial().
>
> This will prevent allocating new slabs on a node while other nodes have
> many free slabs. It does mean sheaves will contain non-local objects in
> that case. Allocations that care about specific node will still be
> served appropriately, but might get a slowpath allocation.
While I can agree pulling memory from other nodes is necessary in some
cases, I believe the patch as proposed is way too agressive and the
commit message does not justify it.
Interestingly there were already reports concerning this, for example:
https://lore.kernel.org/oe-lkp/202601132136.77efd6d7-lkp@intel.com/T/#u
quoting:
* [vbabka:b4/sheaves-for-all-rebased] [slab] aa8fdb9e25: will-it-scale.per_process_ops 46.5% regression
The system at hand has merely 2 nodes and it already got:
%stddev %change %stddev
\ | \
7274 ± 13% -27.0% 5310 ± 16% perf-c2c.DRAM.local
1458 ± 14% +272.3% 5431 ± 10% perf-c2c.DRAM.remote
77502 ± 9% -58.6% 32066 ± 11% perf-c2c.HITM.local
150.83 ± 12% +2150.3% 3394 ± 12% perf-c2c.HITM.remote
77653 ± 9% -54.3% 35460 ± 10% perf-c2c.HITM.total
As in a significant increase in traffic.
Things have to be way worse on systems with 4 and more nodes.
This is not a microbenchmark-specific problem either -- any cache miss
on memory allocated like that induces interconnect traffic. That's a
real slowdown in real workloads.
Admittedly I don't know what the policy is at the moment, it may be
things already suck.
A basic test for sanity is this: suppose you have a process whose all
threads are bound to one node. absent memory shortage in the local
node and allocations which somehow explicitly request a different node,
is it going to get local memory from kmalloc et al?
To my understanding with the patch at hand the answer is no.
Then not only this particular process is penalized for its lifetime, but
everything else is penalized on top -- even ignoring straight up penalty
for interconnect traffic, there is only so much it can handle to begin
with.
Readily usable slabs in other nodes should be of no significance as long
as there are enough resources locally.
If you are looking to reduce total memory usage, I would instead check
how things work out for resuing the same backing pages for differently
sizes objects (I mean is it even implemented?) and would investigate if
additional kmalloc slab sizes would help -- there are power-of-2 jumps
all the way to 8k. Chances are decent sizes like 384 and 768 bytes would
in fact drop real memory requirement.
iow, I think this patch should be dropped at least for the time being
Powered by blists - more mailing lists