[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <aWpukFnKRoeSrcEZ@cmpxchg.org>
Date: Fri, 16 Jan 2026 12:00:00 -0500
From: Johannes Weiner <hannes@...xchg.org>
To: Jiayuan Chen <jiayuan.chen@...ux.dev>
Cc: linux-mm@...ck.org, shakeel.butt@...ux.dev,
Jiayuan Chen <jiayuan.chen@...pee.com>,
Andrew Morton <akpm@...ux-foundation.org>,
David Hildenbrand <david@...nel.org>,
Lorenzo Stoakes <lorenzo.stoakes@...cle.com>,
"Liam R. Howlett" <Liam.Howlett@...cle.com>,
Vlastimil Babka <vbabka@...e.cz>, Mike Rapoport <rppt@...nel.org>,
Suren Baghdasaryan <surenb@...gle.com>,
Michal Hocko <mhocko@...e.com>,
Axel Rasmussen <axelrasmussen@...gle.com>,
Yuanchu Xie <yuanchu@...gle.com>, Wei Xu <weixugc@...gle.com>,
Steven Rostedt <rostedt@...dmis.org>,
Masami Hiramatsu <mhiramat@...nel.org>,
Mathieu Desnoyers <mathieu.desnoyers@...icios.com>,
Brendan Jackman <jackmanb@...gle.com>, Zi Yan <ziy@...dia.com>,
Qi Zheng <zhengqi.arch@...edance.com>, linux-kernel@...r.kernel.org,
linux-trace-kernel@...r.kernel.org
Subject: Re: [PATCH v3 1/2] mm/vmscan: mitigate spurious kswapd_failures
reset from direct reclaim
On Wed, Jan 14, 2026 at 03:40:35PM +0800, Jiayuan Chen wrote:
> From: Jiayuan Chen <jiayuan.chen@...pee.com>
>
> When kswapd fails to reclaim memory, kswapd_failures is incremented.
> Once it reaches MAX_RECLAIM_RETRIES, kswapd stops running to avoid
> futile reclaim attempts. However, any successful direct reclaim
> unconditionally resets kswapd_failures to 0, which can cause problems.
>
> We observed an issue in production on a multi-NUMA system where a
> process allocated large amounts of anonymous pages on a single NUMA
> node, causing its watermark to drop below high and evicting most file
> pages:
>
> $ numastat -m
> Per-node system memory usage (in MBs):
> Node 0 Node 1 Total
> --------------- --------------- ---------------
> MemTotal 128222.19 127983.91 256206.11
> MemFree 1414.48 1432.80 2847.29
> MemUsed 126807.71 126551.11 252358.82
> SwapCached 0.00 0.00 0.00
> Active 29017.91 25554.57 54572.48
> Inactive 92749.06 95377.00 188126.06
> Active(anon) 28998.96 23356.47 52355.43
> Inactive(anon) 92685.27 87466.11 180151.39
> Active(file) 18.95 2198.10 2217.05
> Inactive(file) 63.79 7910.89 7974.68
>
> With swap disabled, only file pages can be reclaimed. When kswapd is
> woken (e.g., via wake_all_kswapds()), it runs continuously but cannot
> raise free memory above the high watermark since reclaimable file pages
> are insufficient. Normally, kswapd would eventually stop after
> kswapd_failures reaches MAX_RECLAIM_RETRIES.
>
> However, containers on this machine have memory.high set in their
> cgroup. Business processes continuously trigger the high limit, causing
> frequent direct reclaim that keeps resetting kswapd_failures to 0. This
> prevents kswapd from ever stopping.
>
> The key insight is that direct reclaim triggered by cgroup memory.high
> performs aggressive scanning to throttle the allocating process. With
> sufficiently aggressive scanning, even hot pages will eventually be
> reclaimed, making direct reclaim "successful" at freeing some memory.
> However, this success does not mean the node has reached a balanced
> state - the freed memory may still be insufficient to bring free pages
> above the high watermark. Unconditionally resetting kswapd_failures in
> this case keeps kswapd alive indefinitely.
>
> The result is that kswapd runs endlessly. Unlike direct reclaim which
> only reclaims from the allocating cgroup, kswapd scans the entire node's
> memory. This causes hot file pages from all workloads on the node to be
> evicted, not just those from the cgroup triggering memory.high. These
> pages constantly refault, generating sustained heavy IO READ pressure
> across the entire system.
>
> Fix this by only resetting kswapd_failures when the node is actually
> balanced. This allows both kswapd and direct reclaim to clear
> kswapd_failures upon successful reclaim, but only when the reclaim
> actually resolves the memory pressure (i.e., the node becomes balanced).
>
> Signed-off-by: Jiayuan Chen <jiayuan.chen@...pee.com>
> Signed-off-by: Jiayuan Chen <jiayuan.chen@...ux.dev>
Great analysis, and I agree with both the fix and adding tracepoints.
Two minor nits:
> @@ -2650,6 +2650,25 @@ static bool can_age_anon_pages(struct lruvec *lruvec,
> lruvec_memcg(lruvec));
> }
>
> +static void pgdat_reset_kswapd_failures(pg_data_t *pgdat)
> +{
> + atomic_set(&pgdat->kswapd_failures, 0);
> +/*
> + * Reset kswapd_failures only when the node is balanced. Without this
> + * check, successful direct reclaim (e.g., from cgroup memory.high
> + * throttling) can keep resetting kswapd_failures even when the node
> + * cannot be balanced, causing kswapd to run endlessly.
> + */
> +static bool pgdat_balanced(pg_data_t *pgdat, int order, int highest_zoneidx);
> +static inline void pgdat_try_reset_kswapd_failures(struct pglist_data *pgdat,
Please remove the inline, the compiler will figure it out.
> + struct scan_control *sc)
> +{
> + if (pgdat_balanced(pgdat, sc->order, sc->reclaim_idx))
> + pgdat_reset_kswapd_failures(pgdat);
> +}
As this is kswapd API, please move these down to after wakeup_kswapd().
I think we can streamline the names a bit. We already use "hopeless"
for that state in the comments; can you please rename the functions
kswapd_clear_hopeless() and kswapd_try_clear_hopeless()?
We should then also replace the open-coded kswapd_failure checks with
kswapd_test_hopeless(). But I can send a follow-up patch if you don't
want to, just let me know.
Powered by blists - more mailing lists