lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <CAC5umyjOgZE0Qpa3W3qZ=sSkwkuf_md47jctXgi5UKWuG49o1Q@mail.gmail.com>
Date: Thu, 22 Jan 2026 09:32:51 +0900
From: Akinobu Mita <akinobu.mita@...il.com>
To: Gregory Price <gourry@...rry.net>
Cc: Michal Hocko <mhocko@...e.com>, linux-cxl@...r.kernel.org, 
	linux-kernel@...r.kernel.org, linux-mm@...ck.org, akpm@...ux-foundation.org, 
	axelrasmussen@...gle.com, yuanchu@...gle.com, weixugc@...gle.com, 
	hannes@...xchg.org, david@...nel.org, zhengqi.arch@...edance.com, 
	shakeel.butt@...ux.dev, lorenzo.stoakes@...cle.com, Liam.Howlett@...cle.com, 
	vbabka@...e.cz, rppt@...nel.org, surenb@...gle.com, ziy@...dia.com, 
	matthew.brost@...el.com, joshua.hahnjy@...il.com, rakie.kim@...com, 
	byungchul@...com, ying.huang@...ux.alibaba.com, apopple@...dia.com, 
	bingjiao@...gle.com, jonathan.cameron@...wei.com, 
	pratyush.brahma@....qualcomm.com
Subject: Re: [PATCH v4 3/3] mm/vmscan: don't demote if there is not enough
 free memory in the lower memory tier

2026年1月15日(木) 9:40 Akinobu Mita <akinobu.mita@...il.com>:
>
> 2026年1月15日(木) 2:49 Gregory Price <gourry@...rry.net>:
> >
> > On Wed, Jan 14, 2026 at 09:51:28PM +0900, Akinobu Mita wrote:
> > > can_demote() is called from four places.
> > > I tried modifying the patch to change the behavior only when can_demote()
> > > is called from shrink_folio_list(), but the problem was not fixed
> > > (oom did not occur).
> > >
> > > Similarly, changing the behavior of can_demote() when called from
> > > can_reclaim_anon_pages(), shrink_folio_list(), and can_age_anon_pages(),
> > > but not when called from get_swappiness(), did not fix the problem either
> > > (oom did not occur).
> > >
> > > Conversely, changing the behavior only when called from get_swappiness(),
> > > but not changing the behavior of can_reclaim_anon_pages(),
> > > shrink_folio_list(), and can_age_anon_pages(), fixed the problem
> > > (oom did occur).
> > >
> > > Therefore, it appears that the behavior of get_swappiness() is important
> > > in this issue.
> >
> > "It appears that..." and the process of twiddling bits and observing
> > behavior does not strike confidence in this solution.
> >
> > Can you take another go at trying to define the bad interaction more
> > explicitly? I worry that we're modifying vmscan.c behavior to induce an
> > OOM for a corner case - but it will also cause another regression.
>
> I agree.
> It surprised me that the behavior of get_swappiness() had an impact on the
> issue, so I'll clarify its relationship to this issue.

To investigate what was happening while the system was inoperable due
to this issue, I applied a patch that automatically resets
demotion_enabled to false after a certain period of time has passed
since demotion_enabled was set to true.

This allowed me to investigate what was happening during this time,
and it showed that the system was not in a permanently inoperable
state such as a deadlock, but was just wasting time while
demotion_enabled was true.

I measured the elapsed time for __alloc_pages_slowpath() that called
out_of_memory() and the number of folios scanned during its execution,
i.e., the total increase in scan_control.nr_to_scan per execution of
shrink_zones(), several times.

When demotion_enabled was initially false, the longest
__alloc_pages_slowpath() execution time was 185 ms, with 18 calls
to try_to_free_pages() and 3095 folio scans.

On the other hand, when demotion_enabled was true,
__alloc_pages_slowpath() took 144692 ms, try_to_free_pages() was
called once, and 5811414 folio scans were performed.

However, as mentioned above, in this case, demotion_enabled
automatically returns to false during execution, limiting the number
of folios that can be scanned and speeding up completion; otherwise,
it would have taken longer and required more folio scans.

Almost all of the execution time is consumed by folio_alloc_swap(),
and analysis using Flame Graph reveals that spinlock contention is
occurring in the call path __mem_cgroup_try_charge_swap ->
__memcg_memory_event -> cgroup_file_notify.

In this reproduction procedure, no swap is configured, and calls to
folio_alloc_swap() always fail. To avoid spinlock contention, I tried
modifying the source code to return -ENOMEM without calling
folio_alloc_swap(), but this caused other lock contention
(lruvec->lru_lock in evict_folios()) in several other places, so it
did not work around the problem.

When demotion_enabled is true, if there is no free memory on the target
node during memory allocation, even if there is no swap device, demotion
may be able to move anonymous pages to a lower node and free up memory,
so more anonymous pages become candidates for eviction.
However, if free memory on the target node for demotion runs out,
various processes will perform similar operations in search of free
memory, wasting time on lock contention.

Reducing lock contention or changing the eviction process is also an
interesting solution, but at present I have not come up with any workaround
other than disabling demotion when free memory on lower-level nodes is
exhausted.

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ