[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <aXksUiwYGwad5JvC@gourry-fedora-PF4VCD3F>
Date: Tue, 27 Jan 2026 16:21:22 -0500
From: Gregory Price <gourry@...rry.net>
To: Akinobu Mita <akinobu.mita@...il.com>
Cc: Michal Hocko <mhocko@...e.com>, linux-cxl@...r.kernel.org,
linux-kernel@...r.kernel.org, linux-mm@...ck.org,
akpm@...ux-foundation.org, axelrasmussen@...gle.com,
yuanchu@...gle.com, weixugc@...gle.com, hannes@...xchg.org,
david@...nel.org, zhengqi.arch@...edance.com,
shakeel.butt@...ux.dev, lorenzo.stoakes@...cle.com,
Liam.Howlett@...cle.com, vbabka@...e.cz, rppt@...nel.org,
surenb@...gle.com, ziy@...dia.com, matthew.brost@...el.com,
joshua.hahnjy@...il.com, rakie.kim@...com, byungchul@...com,
ying.huang@...ux.alibaba.com, apopple@...dia.com,
bingjiao@...gle.com, jonathan.cameron@...wei.com,
pratyush.brahma@....qualcomm.com
Subject: Re: [PATCH v4 3/3] mm/vmscan: don't demote if there is not enough
free memory in the lower memory tier
On Mon, Jan 26, 2026 at 10:57:11AM +0900, Akinobu Mita wrote:
> >
> > Doesn't this suggest what I mentioned earlier? If you don't demote when
> > the target node is full, then you're removing a memory pressure signal
> > from the lower node and reclaim won't ever clean up the lower node to
> > make room for future demotions.
>
> Thank you for your analysis.
> Now I finally understand the concerns (though I'll need to learn more
> to find a solution...)
>
Apologies - sorry for the multiple threads, i accidentally replied on v3
It's taken me a while to detangle this, but what looks like what might
be happening is demote_folios is actually stealing all the potential
candidates for swap for leaving reclaim with no forward progress and no
OOM signal.
1) demotion is already not a reclaim signal, so forgive my prior
comments, i missed the masking of ~__GFP_RECLAIM
2) it appears we spend most of the time building the demotion list, but
then just abandon the list without having made progress later when
the demotion allocation target fails (w/ __THISNODE you don't get
OOM on allocation failure, we just continue)
3) i don't see hugetlb pages causing the GFP_RECLAIM override bug being
an issue in reclaim, because the page->lru is used for something else
in hugetlb pages (i.e. we shouldn't see hugetlb pages here)
4) skipping the entire demotion pass will shunt all this pressure to
swap instead (do_demote_pass = false -> so we swap instead).
The risk here is that the OOM situation is temporary and some amount of
memory from toptier gets shunting to swap while kswapd on other tiers
makes progress. This is effectively LRU inversion.
Why swappiness affects behavior is likely because it changes how
aggressively your lower-tier gets reclaimed, and therefore reduces the
upper tier demotion failures until swap is already pressured.
I'm not sure there's a best-option here, we may need additional input to
determine what the least-worst option is. Causing LRU inversion when
all the nodes are pressured but swap is available is not preferable.
~Gregory
Powered by blists - more mailing lists