linux-kernel - Re: [PATCH] mm: page_alloc: avoid kswapd thrashing due to NUMA restrictions

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20251001162347.GA1597553@cmpxchg.org>
Date: Wed, 1 Oct 2025 12:23:47 -0400
From: Johannes Weiner <hannes@...xchg.org>
To: Zi Yan <ziy@...dia.com>
Cc: Andrew Morton <akpm@...ux-foundation.org>,
	Vlastimil Babka <vbabka@...e.cz>,
	Suren Baghdasaryan <surenb@...gle.com>,
	Michal Hocko <mhocko@...e.com>,
	Brendan Jackman <jackmanb@...gle.com>, linux-mm@...ck.org,
	linux-kernel@...r.kernel.org, Gregory Price <gourry@...rry.net>,
	Joshua Hahn <joshua.hahnjy@...il.com>
Subject: Re: [PATCH] mm: page_alloc: avoid kswapd thrashing due to NUMA
 restrictions

Sorry I missed your reply :(

On Fri, Sep 19, 2025 at 01:18:28PM -0400, Zi Yan wrote:
> On 19 Sep 2025, at 12:21, Johannes Weiner wrote:
> 
> > On NUMA systems without bindings, allocations check all nodes for free
> > space, then wake up the kswapds on all nodes and retry. This ensures
> > all available space is evenly used before reclaim begins. However,
> > when one process or certain allocations have node restrictions, they
> > can cause kswapds on only a subset of nodes to be woken up.
> >
> > Since kswapd hysteresis targets watermarks that are *higher* than
> > needed for allocation, even *unrestricted* allocations can now get
> > suckered onto such nodes that are already pressured. This ends up
> > concentrating all allocations on them, even when there are idle nodes
> > available for the unrestricted requests.
> 
> This is because we build the zonelist from node 0 to the last node
> and getting free pages always follows zonelist order, right?

Yes, exactly.

> > This was observed with two numa nodes, where node0 is normal and node1
> > is ZONE_MOVABLE to facilitate hotplugging: a kernel allocation wakes
> > kswapd on node0 only (since node1 is not eligible); once kswapd0 is
> > active, the watermarks hover between low and high, and then even the
> > movable allocations end up on node0, only to be kicked out again;
> > meanwhile node1 is empty and idle.
> >
> > Similar behavior is possible when a process with NUMA bindings is
> > causing selective kswapd wakeups.
> >
> > To fix this, on NUMA systems augment the (misleading) watermark test
> > with a check for whether kswapd is already active during the first
> > iteration through the zonelist. If this fails to place the request,
> > kswapd must be running everywhere already, and the watermark test is
> > good enough to decide placement.
> >
> > With this patch, unrestricted requests successfully make use of node1,
> > even while kswapd is reclaiming node0 for restricted allocations.
> 
> Thinking about this from memory tiering POV, when a fast node (e.g., node 0,
> and assume node 1 is a slow node) is evicting cold pages using kswapd,
> unrestricted programs will see performance degradation after your change.
> Since before the change, they start from a fast node, but now they start from
> a slow node.

I don't think that's quite right. The default local-first NUMA policy
absent any bindings or zone restrictions is that you first fill node0,
*then* you fill node1, *then* kswapd is woken up on both nodes - at
which point new allocations would go wherever there is room in order
of preference.

I'm just making it so that iff kswapd0 is woken prematurely due to
restrictions, we still fill node1.

In either case, node1 is only filled when node0 space is exhausted.

> Maybe kernel wants to shuffle zonelist based on the emptiness of each zone,
> trying to spread allocations across all zones. For memory tiering,
> spreading allocation should be done within a tier. Since even with this fix,
> in a case where there are 3 nodes, node 0 is heavily used by restricted
> allocations, node 2 will be unused until node 1 is full for unrestricted
> allocations and unnecessary kswapd wake on node 1 can happen.

Kswapd on node1 only wakes once node2 is watermark-full as well. This
is the intended behavior of the "local first" numa policy. I'm not
trying to implement interleaving, it's purely about the quirk that
watermarks alone are not reliable predictors for whether a node is
full or not if kswapd is running.

So we would expect to see

fill node0 -> fill node1 -> fill node2 -> wake all sleeping kswapds

- without restricted allocations in the vanilla kernel
- with restricted allocations after this patch.