linux-kernel - Re: [PATCH 3/5] mm: page_alloc: defrag

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20250314205039.GC1316033@cmpxchg.org>
Date: Fri, 14 Mar 2025 16:50:39 -0400
From: Johannes Weiner <hannes@...xchg.org>
To: Zi Yan <ziy@...dia.com>
Cc: Andrew Morton <akpm@...ux-foundation.org>,
	Vlastimil Babka <vbabka@...e.cz>,
	Mel Gorman <mgorman@...hsingularity.net>, linux-mm@...ck.org,
	linux-kernel@...r.kernel.org
Subject: Re: [PATCH 3/5] mm: page_alloc: defrag_mode

On Fri, Mar 14, 2025 at 02:54:03PM -0400, Zi Yan wrote:
> On 13 Mar 2025, at 17:05, Johannes Weiner wrote:
> 
> > The page allocator groups requests by migratetype to stave off
> > fragmentation. However, in practice this is routinely defeated by the
> > fact that it gives up *before* invoking reclaim and compaction - which
> > may well produce suitable pages. As a result, fragmentation of
> > physical memory is a common ongoing process in many load scenarios.
> >
> > Fragmentation deteriorates compaction's ability to produce huge
> > pages. Depending on the lifetime of the fragmenting allocations, those
> > effects can be long-lasting or even permanent, requiring drastic
> > measures like forcible idle states or even reboots as the only
> > reliable ways to recover the address space for THP production.
> >
> > In a kernel build test with supplemental THP pressure, the THP
> > allocation rate steadily declines over 15 runs:
> >
> >     thp_fault_alloc
> >     61988
> >     56474
> >     57258
> >     50187
> >     52388
> >     55409
> >     52925
> >     47648
> >     43669
> >     40621
> >     36077
> >     41721
> >     36685
> >     34641
> >     33215
> >
> > This is a hurdle in adopting THP in any environment where hosts are
> > shared between multiple overlapping workloads (cloud environments),
> > and rarely experience true idle periods. To make THP a reliable and
> > predictable optimization, there needs to be a stronger guarantee to
> > avoid such fragmentation.
> >
> > Introduce defrag_mode. When enabled, reclaim/compaction is invoked to
> > its full extent *before* falling back. Specifically, ALLOC_NOFRAGMENT
> > is enforced on the allocator fastpath and the reclaiming slowpath.
> >
> > For now, fallbacks are permitted to avert OOMs. There is a plan to add
> > defrag_mode=2 to prefer OOMs over fragmentation, but this requires
> > additional prep work in compaction and the reserve management to make
> > it ready for all possible allocation contexts.
> >
> > The following test results are from a kernel build with periodic
> > bursts of THP allocations, over 15 runs:
> >
> >                                         vanilla    defrag_mode=1
> > @claimer[unmovable]:                        189              103
> > @claimer[movable]:                           92              103
> > @claimer[reclaimable]:                      207               61
> > @pollute[unmovable from movable]:            25                0
> > @pollute[unmovable from reclaimable]:        28                0
> > @pollute[movable from unmovable]:         38835                0
> > @pollute[movable from reclaimable]:      147136                0
> > @pollute[reclaimable from unmovable]:       178                0
> > @pollute[reclaimable from movable]:          33                0
> > @steal[unmovable from movable]:              11                0
> > @steal[unmovable from reclaimable]:           5                0
> > @steal[reclaimable from unmovable]:         107                0
> > @steal[reclaimable from movable]:            90                0
> > @steal[movable from reclaimable]:           354                0
> > @steal[movable from unmovable]:             130                0
> >
> > Both types of polluting fallbacks are eliminated in this workload.
> >
> > Interestingly, whole block conversions are reduced as well. This is
> > because once a block is claimed for a type, its empty space remains
> > available for future allocations, instead of being padded with
> > fallbacks; this allows the native type to group up instead of
> > spreading out to new blocks. The assumption in the allocator has been
> > that pollution from movable allocations is less harmful than from
> > other types, since they can be reclaimed or migrated out should the
> > space be needed. However, since fallbacks occur *before*
> > reclaim/compaction is invoked, movable pollution will still cause
> > non-movable allocations to spread out and claim more blocks.
> >
> > Without fragmentation, THP rates hold steady with defrag_mode=1:
> >
> >     thp_fault_alloc
> >     32478
> >     20725
> >     45045
> >     32130
> >     14018
> >     21711
> >     40791
> >     29134
> >     34458
> >     45381
> >     28305
> >     17265
> >     22584
> >     28454
> >     30850
> >
> > While the downward trend is eliminated, the keen reader will of course
> > notice that the baseline rate is much smaller than the vanilla
> > kernel's to begin with. This is due to deficiencies in how reclaim and
> > compaction are currently driven: ALLOC_NOFRAGMENT increases the extent
> > to which smaller allocations are competing with THPs for pageblocks,
> > while making no effort themselves to reclaim or compact beyond their
> > own request size. This effect already exists with the current usage of
> > ALLOC_NOFRAGMENT, but is amplified by defrag_mode insisting on whole
> > block stealing much more strongly.
> >
> > Subsequent patches will address defrag_mode reclaim strategy to raise
> > the THP success baseline above the vanilla kernel.
> 
> All makes sense to me. But is there a better name than defrag_mode?
> It sounds very similar to /sys/kernel/mm/transparent_hugepage/defrag.
> Or it actually means the THP defrag mode?

Thanks for taking a look!

I'm not set on defrag_mode, but I also couldn't think of anything
better.

The proximity to the THP flag name strikes me as beneficial, since
it's an established term for "try harder to make huge pages".

Suggestions welcome :)

> > Signed-off-by: Johannes Weiner <hannes@...xchg.org>
> > ---
> >  Documentation/admin-guide/sysctl/vm.rst |  9 +++++++++
> >  mm/page_alloc.c                         | 27 +++++++++++++++++++++++--
> >  2 files changed, 34 insertions(+), 2 deletions(-)
> >
> 
> When I am checking ALLOC_NOFRAGMENT, I find that in get_page_from_freelist(),
> ALLOC_NOFRAGMENT is removed when allocation goes into a remote node. I wonder
> if this could reduce the anti-fragmentation effort for NUMA systems. Basically,
> falling back to a remote node for allocation would fragment the remote node,
> even the remote node is trying hard to not fragment itself. Have you tested
> on a NUMA system?

There is this hunk in the patch:

@@ -3480,7 +3486,7 @@ get_page_from_freelist(gfp_t gfp_mask, unsigned int order, int alloc_flags,
 				continue;
 		}
 
-		if (no_fallback && nr_online_nodes > 1 &&
+		if (no_fallback && !defrag_mode && nr_online_nodes > 1 &&
 		    zone != zonelist_zone(ac->preferred_zoneref)) {
 			int local_nid;
 
So it shouldn't clear the flag when spilling into the next node.

Am I missing something?