[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <langyedbbu7b4zkz5o7yy7m7bdlusoa3zwsjbgrqt2p7ou37qm@fi7rovfl5gfz>
Date: Mon, 12 Jan 2026 13:29:06 -0800
From: Shakeel Butt <shakeel.butt@...ux.dev>
To: Jiayuan Chen <jiayuan.chen@...ux.dev>
Cc: Michal Hocko <mhocko@...e.com>, linux-mm@...ck.org,
Jiayuan Chen <jiayuan.chen@...pee.com>, Andrew Morton <akpm@...ux-foundation.org>,
Johannes Weiner <hannes@...xchg.org>, David Hildenbrand <david@...nel.org>,
Qi Zheng <zhengqi.arch@...edance.com>, Lorenzo Stoakes <lorenzo.stoakes@...cle.com>,
Axel Rasmussen <axelrasmussen@...gle.com>, Yuanchu Xie <yuanchu@...gle.com>, Wei Xu <weixugc@...gle.com>,
linux-kernel@...r.kernel.org
Subject: Re: [PATCH v2] mm/vmscan: mitigate spurious kswapd_failures reset
from direct reclaim
Hi Jiayuan,
Sorry for late reply. Let me respond in-place below.
On Wed, Jan 07, 2026 at 11:39:36AM +0000, Jiayuan Chen wrote:
[...]
>
> Hi Shakeel,
>
> Thanks for the feedback.
>
> To be honest, the issue is difficult to reproduce because the boundary conditions are quite complex.
> We also haven't deployed this patch in production yet. I discovered the relationship between
> kswapd_failures and direct reclaim through the following bpftrace script:
>
> '''bash
>
> bpftrace -e '
> #include <linux/mmzone.h>
> #include <linux/shrinker.h>
> kprobe:balance_pgdat {
> $pgdat = (struct pglist_data *)arg0;
> if ($pgdat->kswapd_failures > 0) {
> printf("[node %d] [%lu] kswapd end, kswapd_failures %d\n", $pgdat->node_id, jiffies, $pgdat->kswapd_failures);
> }
> }
> tracepoint:vmscan:mm_vmscan_direct_reclaim_end {
> printf("[cpu %d] [%ul] reset kswapd_failures %d \n", cpu, jiffies, args.nr_reclaimed)
> }
> '
>
> '''
>
> The trace results showed that when kswapd_failures reaches 15, continuous direct reclaim keeps
> resetting it to 0. This was accompanied by a flood of kswapd_failures log entries, and shortly
> after, we observed massive refaults occurring.
> (Note that I can only observe up to 15 in the trace due to a kprobe limitation:
> the kprobe on balance_pgdat fires at function entry, but kswapd_failures is incremented to 16 only
> when balance_pgdat fails to reclaim any pages - at which point kswapd goes to sleep and there's no
> suitable hook point to capture it.)
>
>
> Before I send v3, I'd like to continue the discussion to make sure we're aligned on the approach:
>
> Do you think the bpftrace evidence above is sufficient?
Mainly I want to see if the patch is contributing positively or
negatively in the situation you are seeing in your production. Overall I
think Michal and I are on the same page that the patch is net positive
but the testing in production would eliminate the concerns completely.
Anyways we can proceed with the patch and we can always change in future
if this does not work. Please go ahead with v3 with additional
explanation.
>
>
> If you and Michal are okay with the current approach, I'll prepare v3 with mote detailed comments addressed.
>
> By the way, this tracing limitation makes me wonder: would it be appropriate to add two tracepoints for
> kswapd_failures? One for when kswapd_failures reaches MAX_RECLAIM_RETRIES (16), and another for when it
> gets reset to 0. Currently, the only way to detect this is by polling node_unreclaimable from /proc/zoneinfo,
> but the sampling interval is usually too coarse to catch these events.
tracepoints are cheap and I am all for more observability. Go ahead and
propose the tracepoints which you see fit.
Powered by blists - more mailing lists