linux-kernel - Re: [PATCH 3/5] mm: page_alloc: defrag

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20250323034657.GD1894930@cmpxchg.org>
Date: Sat, 22 Mar 2025 23:46:57 -0400
From: Johannes Weiner <hannes@...xchg.org>
To: Brendan Jackman <jackmanb@...gle.com>
Cc: Andrew Morton <akpm@...ux-foundation.org>,
	Vlastimil Babka <vbabka@...e.cz>,
	Mel Gorman <mgorman@...hsingularity.net>, Zi Yan <ziy@...dia.com>,
	linux-mm@...ck.org, linux-kernel@...r.kernel.org
Subject: Re: [PATCH 3/5] mm: page_alloc: defrag_mode

On Sat, Mar 22, 2025 at 09:34:09PM -0400, Johannes Weiner wrote:
> On Sat, Mar 22, 2025 at 08:58:27PM -0400, Johannes Weiner wrote:
> > On Sat, Mar 22, 2025 at 04:05:52PM +0100, Brendan Jackman wrote:
> > > On Thu Mar 13, 2025 at 10:05 PM CET, Johannes Weiner wrote:
> > > > +	/* Reclaim/compaction failed to prevent the fallback */
> > > > +	if (defrag_mode) {
> > > > +		alloc_flags &= ALLOC_NOFRAGMENT;
> > > > +		goto retry;
> > > > +	}
> > > 
> > > I can't see where ALLOC_NOFRAGMENT gets cleared, is it supposed to be
> > > here (i.e. should this be ~ALLOC_NOFRAGMENT)?
> 
> Please ignore my previous email, this is actually a much more severe
> issue than I thought at first. The screwed up clearing is bad, but
> this will also not check the flag before retrying, which means the
> thread will retry reclaim/compaction and never reach OOM.
> 
> This code has weeks of load testing, with workloads fine-tuned to
> *avoid* OOM. A blatant OOM test shows this problem immediately.
> 
> A simple fix, but I'll put it through the wringer before sending it.

Ok, here is the patch. I verified this with intentional OOMing 100
times in a loop; this would previously lock up on first try in
defrag_mode, but kills and recovers reliably with this applied.

I also re-ran the full THP benchmarks, to verify that erroneous
looping here did not accidentally contribute to fragmentation
avoidance and thus THP success & latency rates. They were in fact not;
the improvements claimed for defrag_mode are unchanged with this fix:

                                                VANILLA    defrag_mode=1-OOMFIX
Hugealloc Time mean               52739.45 (    +0.00%)   27342.44 (   -48.15%)
Hugealloc Time stddev             56541.26 (    +0.00%)   33227.16 (   -41.23%)
Kbuild Real time                    197.47 (    +0.00%)     196.32 (    -0.58%)
Kbuild User time                   1240.49 (    +0.00%)    1231.89 (    -0.69%)
Kbuild System time                   70.08 (    +0.00%)      58.75 (   -15.95%)
THP fault alloc                   46727.07 (    +0.00%)   62669.93 (   +34.12%)
THP fault fallback                21910.60 (    +0.00%)    5966.40 (   -72.77%)
Direct compact fail                 195.80 (    +0.00%)      50.53 (   -73.81%)
Direct compact success                7.93 (    +0.00%)       4.07 (   -43.28%)
Compact daemon scanned migrate  3369601.27 (    +0.00%) 1588238.93 (   -52.87%)
Compact daemon scanned free     5075474.47 (    +0.00%) 1441944.27 (   -71.59%)
Compact direct scanned migrate   161787.27 (    +0.00%)   64838.53 (   -59.92%)
Compact direct scanned free      163467.53 (    +0.00%)   37243.00 (   -77.22%)
Compact total migrate scanned   3531388.53 (    +0.00%) 1653077.47 (   -53.19%)
Compact total free scanned      5238942.00 (    +0.00%) 1479187.27 (   -71.77%)
Alloc stall                        2371.07 (    +0.00%)     553.00 (   -76.64%)
Pages kswapd scanned            2160926.73 (    +0.00%) 4052539.93 (   +87.54%)
Pages kswapd reclaimed           533191.07 (    +0.00%)  765447.47 (   +43.56%)
Pages direct scanned             400450.33 (    +0.00%)  358933.93 (   -10.37%)
Pages direct reclaimed            94441.73 (    +0.00%)   26991.60 (   -71.42%)
Pages total scanned             2561377.07 (    +0.00%) 4411473.87 (   +72.23%)
Pages total reclaimed            627632.80 (    +0.00%)  792439.07 (   +26.26%)
Swap out                          47959.53 (    +0.00%)  128511.80 (  +167.96%)
Swap in                            7276.00 (    +0.00%)   27736.20 (  +281.16%)
File refaults                    138043.00 (    +0.00%)  206198.40 (   +49.37%)

Many thanks for your careful review, Brendan.

---

>From c84651a46910448c6cfaf44885644fdb215f7f6a Mon Sep 17 00:00:00 2001
From: Johannes Weiner <hannes@...xchg.org>
Date: Sat, 22 Mar 2025 19:21:45 -0400
Subject: [PATCH] mm: page_alloc: fix defrag_mode's retry & OOM path

Brendan points out that defrag_mode doesn't properly clear
ALLOC_NOFRAGMENT on its last-ditch attempt to allocate. But looking
closer, the problem is actually more severe: it doesn't actually
*check* whether it's already retried, and keeps looping. This means
the OOM path is never taken, and the thread can loop indefinitely.

This is verified with an intentional OOM test on defrag_mode=1, which
results in the machine hanging. After this patch, it triggers the OOM
kill reliably and recovers.

Clear ALLOC_NOFRAGMENT properly, and only retry once.

Fixes: e3aa7df331bc ("mm: page_alloc: defrag_mode")
Reported-by: Brendan Jackman <jackmanb@...gle.com>
Signed-off-by: Johannes Weiner <hannes@...xchg.org>
---
 mm/page_alloc.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 0c01998cb3a0..582364d42906 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4543,8 +4543,8 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 		goto retry;
 
 	/* Reclaim/compaction failed to prevent the fallback */
-	if (defrag_mode) {
-		alloc_flags &= ALLOC_NOFRAGMENT;
+	if (defrag_mode && (alloc_flags & ALLOC_NOFRAGMENT)) {
+		alloc_flags &= ~ALLOC_NOFRAGMENT;
 		goto retry;
 	}
 
-- 
2.49.0