linux-kernel - Re: [PATCH v2] mm/vmscan: skip increasing kswapd

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <aRWswVgIaAqJEvQB@tiehlicka>
Date: Thu, 13 Nov 2025 11:02:41 +0100
From: Michal Hocko <mhocko@...e.com>
To: Shakeel Butt <shakeel.butt@...ux.dev>
Cc: Jiayuan Chen <jiayuan.chen@...ux.dev>, linux-mm@...ck.org,
	Andrew Morton <akpm@...ux-foundation.org>,
	Johannes Weiner <hannes@...xchg.org>,
	David Hildenbrand <david@...hat.com>,
	Qi Zheng <zhengqi.arch@...edance.com>,
	Lorenzo Stoakes <lorenzo.stoakes@...cle.com>,
	Axel Rasmussen <axelrasmussen@...gle.com>,
	Yuanchu Xie <yuanchu@...gle.com>, Wei Xu <weixugc@...gle.com>,
	linux-kernel@...r.kernel.org
Subject: Re: [PATCH v2] mm/vmscan: skip increasing kswapd_failures when
 reclaim was boosted

On Fri 07-11-25 17:11:58, Shakeel Butt wrote:
> On Fri, Oct 24, 2025 at 10:27:11AM +0800, Jiayuan Chen wrote:
> > We encountered a scenario where direct memory reclaim was triggered,
> > leading to increased system latency:
> > 
> > 1. The memory.low values set on host pods are actually quite large, some
> >    pods are set to 10GB, others to 20GB, etc.
> > 2. Since most pods have memory protection configured, each time kswapd is
> >    woken up, if a pod's memory usage hasn't exceeded its own memory.low,
> >    its memory won't be reclaimed.
> 
> Can you share the numa configuration of your system? How many nodes are
> there?
> 
> > 3. When applications start up, rapidly consume memory, or experience
> >    network traffic bursts, the kernel reaches steal_suitable_fallback(),
> >    which sets watermark_boost and subsequently wakes kswapd.
> > 4. In the core logic of kswapd thread (balance_pgdat()), when reclaim is
> >    triggered by watermark_boost, the maximum priority is 10. Higher
> >    priority values mean less aggressive LRU scanning, which can result in
> >    no pages being reclaimed during a single scan cycle:
> > 
> > if (nr_boost_reclaim && sc.priority == DEF_PRIORITY - 2)
> >     raise_priority = false;
> 
> Am I understanding this correctly that watermark boost increase the
> chances of this issue but it can still happen?
> 
> > 
> > 5. This eventually causes pgdat->kswapd_failures to continuously
> >    accumulate, exceeding MAX_RECLAIM_RETRIES, and consequently kswapd stops
> >    working. At this point, the system's available memory is still
> >    significantly above the high watermark — it's inappropriate for kswapd
> >    to stop under these conditions.
> > 
> > The final observable issue is that a brief period of rapid memory
> > allocation causes kswapd to stop running, ultimately triggering direct
> > reclaim and making the applications unresponsive.
> > 
> > Signed-off-by: Jiayuan Chen <jiayuan.chen@...ux.dev>
> > 
> > ---
> > v1 -> v2: Do not modify memory.low handling
> > https://lore.kernel.org/linux-mm/20251014081850.65379-1-jiayuan.chen@linux.dev/
> > ---
> >  mm/vmscan.c | 7 ++++++-
> >  1 file changed, 6 insertions(+), 1 deletion(-)
> > 
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index 92f4ca99b73c..fa8663781086 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -7128,7 +7128,12 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int highest_zoneidx)
> >  		goto restart;
> >  	}
> >  
> > -	if (!sc.nr_reclaimed)
> > +	/*
> > +	 * If the reclaim was boosted, we might still be far from the
> > +	 * watermark_high at this point. We need to avoid increasing the
> > +	 * failure count to prevent the kswapd thread from stopping.
> > +	 */
> > +	if (!sc.nr_reclaimed && !boosted)
> >  		atomic_inc(&pgdat->kswapd_failures);
> 
> In general I think not incrementing the failure for boosted kswapd
> iteration is right. If this issue (high protection causing kswap
> failures) happen on non-boosted case, I am not sure what should be right
> behavior i.e. allocators doing direct reclaim potentially below low
> protection or allowing kswapd to reclaim below low. For min, it is very
> clear that direct reclaimer has to reclaim as they may have to trigger
> oom-kill. For low protection, I am not sure.

Our current documention gives us some room for interpretation. I am
wondering whether we need to change the existing implemnetation though.
If kswapd is not able to make progress then we surely have direct
reclaim happening. So I would only change this if we had examples of
properly/sensibly configured systems where kswapd low limit breach could
help to reuduce stalls (improve performance) while the end result from
the amount of reclaimed memory would be same/very similar.

This specific report is an example where boosting was not low limit
aware and I agree that not accounting kswapd_failures for boosted runs
is reasonable thing to do. I am not yet sure this is a complete fix but
it is certainly a good direction.
-- 
Michal Hocko
SUSE Labs