linux-kernel - Re: [PATCH v1] mm/vmscan: mitigate spurious kswapd

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite for Android: free password hash cracker in your pocket

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <u2llnnpmpsgarwrt74ffgo3cuwe4apdbeh5hkclzbh5gykwltb@whb7uuj7ub5i>
Date: Mon, 22 Dec 2025 22:11:05 -0800
From: Shakeel Butt <shakeel.butt@...ux.dev>
To: Jiayuan Chen <jiayuan.chen@...ux.dev>
Cc: linux-mm@...ck.org, Jiayuan Chen <jiayuan.chen@...pee.com>, 
	Andrew Morton <akpm@...ux-foundation.org>, Johannes Weiner <hannes@...xchg.org>, 
	David Hildenbrand <david@...nel.org>, Michal Hocko <mhocko@...nel.org>, 
	Qi Zheng <zhengqi.arch@...edance.com>, Lorenzo Stoakes <lorenzo.stoakes@...cle.com>, 
	Axel Rasmussen <axelrasmussen@...gle.com>, Yuanchu Xie <yuanchu@...gle.com>, Wei Xu <weixugc@...gle.com>, 
	linux-kernel@...r.kernel.org
Subject: Re: [PATCH v1] mm/vmscan: mitigate spurious kswapd_failures reset
 from direct reclaim

On Tue, Dec 23, 2025 at 01:42:37AM +0000, Jiayuan Chen wrote:
> December 23, 2025 at 05:15, "Shakeel Butt" <shakeel.butt@...ux.dev mailto:shakeel.butt@...ux.dev?to=%22Shakeel%20Butt%22%20%3Cshakeel.butt%40linux.dev%3E > wrote:
> 
[...]
> 
> > > 
> > I don't think kswapd is an issue here. The system is out of memory and
> > most of the memory is unreclaimable. Either change the workload to use
> > less memory or enable swap (or zswap) to have more reclaimable memory.
> 
> 
> Hi,
> Thanks for looking into this.
> 
> Sorry, I didn't describe the scenario clearly enough in the original patch. Let me clarify:
> 
> This is a multi-NUMA system where the memory pressure is not global but node-local. The key observation is:
> 
> Node 0: Under memory pressure, most memory is anonymous (unreclaimable without swap)
> Node 1: Has plenty of reclaimable memory (~60GB file cache out of 125GB total)

Thanks and now the situation is much more clear. IIUC you are running
multiple workloads (pods) on the system. How is the memcg limits
configured for these workloads. You mentioned memory.high, what about
memory.max? Also are you using cpusets to limit the pods to individual
nodes (cpu & memory) or they can run on any node?

Overall I still think it is unbalanced numa nodes in terms of memory and
may for cpu as well. Anyways let's talk about kswapd.

> 
> Node 0's kswapd runs continuously but cannot reclaim anything
> Direct reclaim succeeds by reclaiming from Node 1
> Direct reclaim resets kswapd_failures,

So successful reclaim on one node does not reset kswapd_failures on
other node. The kernel reclaims each node one by one, so if Node 0
direct reclaim was successfull only then kernel allows to reset the
kswapd_failures of Node 0 to be reset.

> preventing Node 0's kswapd from stopping
> The few file pages on Node 0 are hot and keep refaulting, causing heavy I/O
> 

Have you tried numa balancing? Though I think it would be better to
schedule upfront in a way that one node is not overcommitted but numa
balancing provides a dynamic way to adjust the load on each node.

Can you dig deeper on who and why Node 0's kswapd_failures is getting
reset?