linux-kernel - Re: [PATCH] mm: page_alloc: avoid kswapd thrashing due to NUMA restrictions

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [day] [month] [year] [list]

Message-ID: <20251001164131.GB1597553@cmpxchg.org>
Date: Wed, 1 Oct 2025 12:41:31 -0400
From: Johannes Weiner <hannes@...xchg.org>
To: Vlastimil Babka <vbabka@...e.cz>
Cc: Andrew Morton <akpm@...ux-foundation.org>,
	Suren Baghdasaryan <surenb@...gle.com>,
	Michal Hocko <mhocko@...e.com>,
	Brendan Jackman <jackmanb@...gle.com>, Zi Yan <ziy@...dia.com>,
	linux-mm@...ck.org, linux-kernel@...r.kernel.org,
	Gregory Price <gourry@...rry.net>,
	Joshua Hahn <joshua.hahnjy@...il.com>
Subject: Re: [PATCH] mm: page_alloc: avoid kswapd thrashing due to NUMA
 restrictions

On Wed, Oct 01, 2025 at 04:59:02PM +0200, Vlastimil Babka wrote:
> On 9/19/25 6:21 PM, Johannes Weiner wrote:
> > On NUMA systems without bindings, allocations check all nodes for free
> > space, then wake up the kswapds on all nodes and retry. This ensures
> > all available space is evenly used before reclaim begins. However,
> > when one process or certain allocations have node restrictions, they
> > can cause kswapds on only a subset of nodes to be woken up.
> > 
> > Since kswapd hysteresis targets watermarks that are *higher* than
> > needed for allocation, even *unrestricted* allocations can now get
> > suckered onto such nodes that are already pressured. This ends up
> > concentrating all allocations on them, even when there are idle nodes
> > available for the unrestricted requests.
> > 
> > This was observed with two numa nodes, where node0 is normal and node1
> > is ZONE_MOVABLE to facilitate hotplugging: a kernel allocation wakes
> > kswapd on node0 only (since node1 is not eligible); once kswapd0 is
> > active, the watermarks hover between low and high, and then even the
> > movable allocations end up on node0, only to be kicked out again;
> > meanwhile node1 is empty and idle.
> 
> Is this because node1 is slow tier as Zi suggested, or we're talking
> about allocations that are from node0's cpu, while allocations on
> node1's cpu would be fine?

It applies in either case. The impetus for this fix was from behavior
in a tiered system, but this seems like a general NUMA problem to me.

Say you have a VM where you use an extra node for runtime resizing,
making it ZONE_MOVABLE to keep it hotpluggable.

> > Similar behavior is possible when a process with NUMA bindings is
> > causing selective kswapd wakeups.
> > 
> > To fix this, on NUMA systems augment the (misleading) watermark test
> > with a check for whether kswapd is already active during the first
> > iteration through the zonelist. If this fails to place the request,
> > kswapd must be running everywhere already, and the watermark test is
> > good enough to decide placement.
> 
> Suppose kswapd finished reclaim already, so this check wouldn't kick in.
> Wouldn't we be over-pressuring node0 still, just somewhat less?

Yes. And we've seen that to a degree, where kswapd goes to sleep
intermittently and the occasional (high - low) batch of fresh pages
makes it into node0 until kswapd is woken up again.

It still fixed the big picture pathological case, though, where
*everything* was just concentrated on node0. So I figured why
complicate it. But there would be room for some hysteresis.

Another option could be, instead of checking kswapds, to check the
watermarks against the high thresholds on that first zonelist
iteration. After all, that's where a recently-gone-to-sleep would
leave the watermark level.

But it would need a fudge factor too, to account for the fact that
kswapd might overreclaim past the high watermark. And the overreclaim
factor is something that has historically fluctuated quite a bit
between systems and kernel versions. So this could be too
fragile. Kswapd being active is a very definitive signal comparably.