linux-kernel - Re: [PATCH] mm: page_alloc: avoid kswapd thrashing due to NUMA restrictions

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <aN1ONiuqoZQTliHG@gourry-fedora-PF4VCD3F>
Date: Wed, 1 Oct 2025 11:52:22 -0400
From: Gregory Price <gourry@...rry.net>
To: Vlastimil Babka <vbabka@...e.cz>
Cc: Johannes Weiner <hannes@...xchg.org>,
	Andrew Morton <akpm@...ux-foundation.org>,
	Suren Baghdasaryan <surenb@...gle.com>,
	Michal Hocko <mhocko@...e.com>,
	Brendan Jackman <jackmanb@...gle.com>, Zi Yan <ziy@...dia.com>,
	linux-mm@...ck.org, linux-kernel@...r.kernel.org,
	Joshua Hahn <joshua.hahnjy@...il.com>
Subject: Re: [PATCH] mm: page_alloc: avoid kswapd thrashing due to NUMA
 restrictions

On Wed, Oct 01, 2025 at 04:59:02PM +0200, Vlastimil Babka wrote:
> On 9/19/25 6:21 PM, Johannes Weiner wrote:
> > 
> > Since kswapd hysteresis targets watermarks that are *higher* than
> > needed for allocation, even *unrestricted* allocations can now get
> > suckered onto such nodes that are already pressured. This ends up
> > concentrating all allocations on them, even when there are idle nodes
> > available for the unrestricted requests.
> > 
> > This was observed with two numa nodes, where node0 is normal and node1
> > is ZONE_MOVABLE to facilitate hotplugging: a kernel allocation wakes
> > kswapd on node0 only (since node1 is not eligible); once kswapd0 is
> > active, the watermarks hover between low and high, and then even the
> > movable allocations end up on node0, only to be kicked out again;
> > meanwhile node1 is empty and idle.
> 
> Is this because node1 is slow tier as Zi suggested, or we're talking
> about allocations that are from node0's cpu, while allocations on
> node1's cpu would be fine?
>
> Also this sounds like something that ZONELIST_ORDER_ZONE handled until
> it was removed. But it wouldn't help with the NUMA binding case.
> 

node1 is a cpu-less memory node with 100% ZONE_MOVABLE memory. Our first
theory was that this was a zone-order vs node-order issue, but we
found this kswapd thrashing to be the issue instead.

No mempolicy was in use here, it's all grounded in GFP/ZONE interactions.

> > Similar behavior is possible when a process with NUMA bindings is
> > causing selective kswapd wakeups.
> > 
> > To fix this, on NUMA systems augment the (misleading) watermark test
> > with a check for whether kswapd is already active during the first
> > iteration through the zonelist. If this fails to place the request,
> > kswapd must be running everywhere already, and the watermark test is
> > good enough to decide placement.
> 
> Suppose kswapd finished reclaim already, so this check wouldn't kick in.
> Wouldn't we be over-pressuring node0 still, just somewhat less?
> 

This is the current and desired behavior when nodes are not in exclusive
zones. We still want the allocations to kick kswapd to reclaim/age/demote
cold folios from the local node to the remote node.

But when that happens, and the remote node is not pressured, there's no
reason to wait for reclaim before servicing an allocation.

Once all the nodes are pressured (all kswapd is running), we end up
back in the position of preferring to wait for a page on the local
node rather than wait for a page on the remote node.

There will obviously be some transient sleep/wake of kswapd, but that's
already the case.  The key observation here is this patch allows for
fallback allocations on remote nodes when nodes have exclusive zone
memberships (node0=NORMAL, node1=MOVABLE).

~Gregory