[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20250313210647.1314586-4-hannes@cmpxchg.org>
Date: Thu, 13 Mar 2025 17:05:34 -0400
From: Johannes Weiner <hannes@...xchg.org>
To: Andrew Morton <akpm@...ux-foundation.org>
Cc: Vlastimil Babka <vbabka@...e.cz>,
Mel Gorman <mgorman@...hsingularity.net>,
Zi Yan <ziy@...dia.com>,
linux-mm@...ck.org,
linux-kernel@...r.kernel.org
Subject: [PATCH 3/5] mm: page_alloc: defrag_mode
The page allocator groups requests by migratetype to stave off
fragmentation. However, in practice this is routinely defeated by the
fact that it gives up *before* invoking reclaim and compaction - which
may well produce suitable pages. As a result, fragmentation of
physical memory is a common ongoing process in many load scenarios.
Fragmentation deteriorates compaction's ability to produce huge
pages. Depending on the lifetime of the fragmenting allocations, those
effects can be long-lasting or even permanent, requiring drastic
measures like forcible idle states or even reboots as the only
reliable ways to recover the address space for THP production.
In a kernel build test with supplemental THP pressure, the THP
allocation rate steadily declines over 15 runs:
thp_fault_alloc
61988
56474
57258
50187
52388
55409
52925
47648
43669
40621
36077
41721
36685
34641
33215
This is a hurdle in adopting THP in any environment where hosts are
shared between multiple overlapping workloads (cloud environments),
and rarely experience true idle periods. To make THP a reliable and
predictable optimization, there needs to be a stronger guarantee to
avoid such fragmentation.
Introduce defrag_mode. When enabled, reclaim/compaction is invoked to
its full extent *before* falling back. Specifically, ALLOC_NOFRAGMENT
is enforced on the allocator fastpath and the reclaiming slowpath.
For now, fallbacks are permitted to avert OOMs. There is a plan to add
defrag_mode=2 to prefer OOMs over fragmentation, but this requires
additional prep work in compaction and the reserve management to make
it ready for all possible allocation contexts.
The following test results are from a kernel build with periodic
bursts of THP allocations, over 15 runs:
vanilla defrag_mode=1
@claimer[unmovable]: 189 103
@claimer[movable]: 92 103
@claimer[reclaimable]: 207 61
@pollute[unmovable from movable]: 25 0
@pollute[unmovable from reclaimable]: 28 0
@pollute[movable from unmovable]: 38835 0
@pollute[movable from reclaimable]: 147136 0
@pollute[reclaimable from unmovable]: 178 0
@pollute[reclaimable from movable]: 33 0
@steal[unmovable from movable]: 11 0
@steal[unmovable from reclaimable]: 5 0
@steal[reclaimable from unmovable]: 107 0
@steal[reclaimable from movable]: 90 0
@steal[movable from reclaimable]: 354 0
@steal[movable from unmovable]: 130 0
Both types of polluting fallbacks are eliminated in this workload.
Interestingly, whole block conversions are reduced as well. This is
because once a block is claimed for a type, its empty space remains
available for future allocations, instead of being padded with
fallbacks; this allows the native type to group up instead of
spreading out to new blocks. The assumption in the allocator has been
that pollution from movable allocations is less harmful than from
other types, since they can be reclaimed or migrated out should the
space be needed. However, since fallbacks occur *before*
reclaim/compaction is invoked, movable pollution will still cause
non-movable allocations to spread out and claim more blocks.
Without fragmentation, THP rates hold steady with defrag_mode=1:
thp_fault_alloc
32478
20725
45045
32130
14018
21711
40791
29134
34458
45381
28305
17265
22584
28454
30850
While the downward trend is eliminated, the keen reader will of course
notice that the baseline rate is much smaller than the vanilla
kernel's to begin with. This is due to deficiencies in how reclaim and
compaction are currently driven: ALLOC_NOFRAGMENT increases the extent
to which smaller allocations are competing with THPs for pageblocks,
while making no effort themselves to reclaim or compact beyond their
own request size. This effect already exists with the current usage of
ALLOC_NOFRAGMENT, but is amplified by defrag_mode insisting on whole
block stealing much more strongly.
Subsequent patches will address defrag_mode reclaim strategy to raise
the THP success baseline above the vanilla kernel.
Signed-off-by: Johannes Weiner <hannes@...xchg.org>
---
Documentation/admin-guide/sysctl/vm.rst | 9 +++++++++
mm/page_alloc.c | 27 +++++++++++++++++++++++--
2 files changed, 34 insertions(+), 2 deletions(-)
diff --git a/Documentation/admin-guide/sysctl/vm.rst b/Documentation/admin-guide/sysctl/vm.rst
index ec6343ee4248..e169dbf48180 100644
--- a/Documentation/admin-guide/sysctl/vm.rst
+++ b/Documentation/admin-guide/sysctl/vm.rst
@@ -29,6 +29,7 @@ files can be found in mm/swap.c.
- compaction_proactiveness
- compaction_proactiveness_leeway
- compact_unevictable_allowed
+- defrag_mode
- dirty_background_bytes
- dirty_background_ratio
- dirty_bytes
@@ -162,6 +163,14 @@ On CONFIG_PREEMPT_RT the default value is 0 in order to avoid a page fault, due
to compaction, which would block the task from becoming active until the fault
is resolved.
+defrag_mode
+===========
+
+When set to 1, the page allocator tries harder to avoid fragmentation
+and maintain the ability to produce huge pages / higher-order pages.
+
+It is recommended to enable this right after boot, as fragmentation,
+once it occurred, can be long-lasting or even permanent.
dirty_background_bytes
======================
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 6f0404941886..9a02772c2461 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -273,6 +273,7 @@ int min_free_kbytes = 1024;
int user_min_free_kbytes = -1;
static int watermark_boost_factor __read_mostly = 15000;
static int watermark_scale_factor = 10;
+static int defrag_mode;
/* movable_zone is the "real" zone pages in ZONE_MOVABLE are taken from */
int movable_zone;
@@ -3389,6 +3390,11 @@ alloc_flags_nofragment(struct zone *zone, gfp_t gfp_mask)
*/
alloc_flags = (__force int) (gfp_mask & __GFP_KSWAPD_RECLAIM);
+ if (defrag_mode) {
+ alloc_flags |= ALLOC_NOFRAGMENT;
+ return alloc_flags;
+ }
+
#ifdef CONFIG_ZONE_DMA32
if (!zone)
return alloc_flags;
@@ -3480,7 +3486,7 @@ get_page_from_freelist(gfp_t gfp_mask, unsigned int order, int alloc_flags,
continue;
}
- if (no_fallback && nr_online_nodes > 1 &&
+ if (no_fallback && !defrag_mode && nr_online_nodes > 1 &&
zone != zonelist_zone(ac->preferred_zoneref)) {
int local_nid;
@@ -3591,7 +3597,7 @@ get_page_from_freelist(gfp_t gfp_mask, unsigned int order, int alloc_flags,
* It's possible on a UMA machine to get through all zones that are
* fragmented. If avoiding fragmentation, reset and try again.
*/
- if (no_fallback) {
+ if (no_fallback && !defrag_mode) {
alloc_flags &= ~ALLOC_NOFRAGMENT;
goto retry;
}
@@ -4128,6 +4134,9 @@ gfp_to_alloc_flags(gfp_t gfp_mask, unsigned int order)
alloc_flags = gfp_to_alloc_flags_cma(gfp_mask, alloc_flags);
+ if (defrag_mode)
+ alloc_flags |= ALLOC_NOFRAGMENT;
+
return alloc_flags;
}
@@ -4510,6 +4519,11 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
&compaction_retries))
goto retry;
+ /* Reclaim/compaction failed to prevent the fallback */
+ if (defrag_mode) {
+ alloc_flags &= ALLOC_NOFRAGMENT;
+ goto retry;
+ }
/*
* Deal with possible cpuset update races or zonelist updates to avoid
@@ -6286,6 +6300,15 @@ static const struct ctl_table page_alloc_sysctl_table[] = {
.extra1 = SYSCTL_ONE,
.extra2 = SYSCTL_THREE_THOUSAND,
},
+ {
+ .procname = "defrag_mode",
+ .data = &defrag_mode,
+ .maxlen = sizeof(defrag_mode),
+ .mode = 0644,
+ .proc_handler = proc_dointvec_minmax,
+ .extra1 = SYSCTL_ZERO,
+ .extra2 = SYSCTL_ONE,
+ },
{
.procname = "percpu_pagelist_high_fraction",
.data = &percpu_pagelist_high_fraction,
--
2.48.1
Powered by blists - more mailing lists