linux-kernel - Re: [PATCH v4 3/3] mm/vmscan: don't demote if there is not enough free memory in the lower memory tier

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [day] [month] [year] [list]

Message-ID: <20260122183453.2619156-1-joshua.hahnjy@gmail.com>
Date: Thu, 22 Jan 2026 13:34:53 -0500
From: Joshua Hahn <joshua.hahnjy@...il.com>
To: Akinobu Mita <akinobu.mita@...il.com>
Cc: Michal Hocko <mhocko@...e.com>,
	linux-cxl@...r.kernel.org,
	linux-kernel@...r.kernel.org,
	linux-mm@...ck.org,
	akpm@...ux-foundation.org,
	axelrasmussen@...gle.com,
	yuanchu@...gle.com,
	weixugc@...gle.com,
	hannes@...xchg.org,
	david@...nel.org,
	zhengqi.arch@...edance.com,
	shakeel.butt@...ux.dev,
	lorenzo.stoakes@...cle.com,
	Liam.Howlett@...cle.com,
	vbabka@...e.cz,
	rppt@...nel.org,
	surenb@...gle.com,
	ziy@...dia.com,
	matthew.brost@...el.com,
	joshua.hahnjy@...il.com,
	rakie.kim@...com,
	byungchul@...com,
	gourry@...rry.net,
	ying.huang@...ux.alibaba.com,
	apopple@...dia.com,
	bingjiao@...gle.com,
	jonathan.cameron@...wei.com,
	pratyush.brahma@....qualcomm.com
Subject: Re: [PATCH v4 3/3] mm/vmscan: don't demote if there is not enough free memory in the lower memory tier

Hello Akinobu,

I hope you are doing well! First of all, sorry for the late review on the
series. I have a few questions about the problem itself, and how it is being
triggered.

> > > On systems with multiple memory-tiers consisting of DRAM and CXL memory,
> > > the OOM killer is not invoked properly.
> > >
> > > Here's the command to reproduce:
> > >
> > > $ sudo swapoff -a
> > > $ stress-ng --oomable -v --memrate 20 --memrate-bytes 10G \
> > >     --memrate-rd-mbs 1 --memrate-wr-mbs 1
> > >
> > > The memory usage is the number of workers specified with the --memrate
> > > option multiplied by the buffer size specified with the --memrate-bytes
> > > option, so please adjust it so that it exceeds the total size of the
> > > installed DRAM and CXL memory.
> > >
> > > If swap is disabled, you can usually expect the OOM killer to terminate
> > > the stress-ng process when memory usage approaches the installed memory
> > > size.
> > >
> > > However, if multiple memory-tiers exist (multiple
> > > /sys/devices/virtual/memory_tiering/memory_tier<N> directories exist) and
> > > /sys/kernel/mm/numa/demotion_enabled is true, the OOM killer will not be
> > > invoked and the system will become inoperable, regardless of whether MGLRU
> > > is enabled or not.
> > >
> > > This issue can be reproduced using NUMA emulation even on systems with
> > > only DRAM.  You can create two-fake memory-tiers by booting a single-node
> > > system with "numa=fake=2 numa_emulation.adistance=576,704" kernel
> > > parameters.

[...snip...]

> can_demote() is called from four places.
> I tried modifying the patch to change the behavior only when can_demote()
> is called from shrink_folio_list(), but the problem was not fixed
> (oom did not occur).
> 
> Similarly, changing the behavior of can_demote() when called from
> can_reclaim_anon_pages(), shrink_folio_list(), and can_age_anon_pages(),
> but not when called from get_swappiness(), did not fix the problem either
> (oom did not occur).
> 
> Conversely, changing the behavior only when called from get_swappiness(),
> but not changing the behavior of can_reclaim_anon_pages(),
> shrink_folio_list(), and can_age_anon_pages(), fixed the problem
> (oom did occur).
> 
> Therefore, it appears that the behavior of get_swappiness() is important
> in this issue.

This is quite mysterious.

Especially because get_swappiness() is an MGLRU exclusive function, I find
it quite strange that the issue you mention above occurs regardless of whether
MGLRU is enabled or disabled. With MGLRU disabled, did you see the same hangs
as before? Were these hangs similarly fixed by modifying the callsite in
get_swappiness?

On a separate note, I feel a bit uncomfortable for making this the default
setting, regardless of whether there is swap space or not. Just as it is
easy to create a degenerate scenario where all memory is unreclaimable
and the system starts going into (wasteful) reclaim on the lower tiers,
it is equally easy to create a scenario where all memory is very easily
reclaimable (say, clean pagecache) and we OOM without making any attempt to
free up memory on the lower tiers.

Reality is likely somewhere in between. And from my perspective, as long as
we have some amount of easily reclaimable memory, I don't think immediately
OOMing will be helpful for the system (and even if none of the memory is
easily reclaimable, we should still try doing something before killing).

> > > The reason for this issue is that memory allocations do not directly
> > > trigger the oom-killer, assuming that if the target node has an underlying
> > > memory tier, it can always be reclaimed by demotion.

This patch enforces that the opposite of this assumption is true; that even
if a target node has an underlying memory tier, it can never be reclaimed by
demotion.

Certainly for systems with swap and some compression methods (z{ram, swap}),
this new enforcement could be harmful to the system. What do you think?

Again, sorry for the late review. I hope you have a great day!
Joshua