[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <20250313210647.1314586-1-hannes@cmpxchg.org>
Date: Thu, 13 Mar 2025 17:05:31 -0400
From: Johannes Weiner <hannes@...xchg.org>
To: Andrew Morton <akpm@...ux-foundation.org>
Cc: Vlastimil Babka <vbabka@...e.cz>,
Mel Gorman <mgorman@...hsingularity.net>,
Zi Yan <ziy@...dia.com>,
linux-mm@...ck.org,
linux-kernel@...r.kernel.org
Subject: [PATCH 0/5] mm: reliable huge page allocator
This series makes changes to the allocator and reclaim/compaction code
to try harder to avoid fragmentation. As a result, this makes huge
page allocations cheaper, more reliable and more sustainable.
It's a subset of the huge page allocator RFC initially proposed here:
https://lore.kernel.org/lkml/20230418191313.268131-1-hannes@cmpxchg.org/
The following results are from a kernel build test, with additional
concurrent bursts of THP allocations on a memory-constrained system.
Comparing before and after the changes over 15 runs:
before after
Hugealloc Time mean 52739.45 ( +0.00%) 28904.00 ( -45.19%)
Hugealloc Time stddev 56541.26 ( +0.00%) 33464.37 ( -40.81%)
Kbuild Real time 197.47 ( +0.00%) 196.59 ( -0.44%)
Kbuild User time 1240.49 ( +0.00%) 1231.67 ( -0.71%)
Kbuild System time 70.08 ( +0.00%) 59.10 ( -15.45%)
THP fault alloc 46727.07 ( +0.00%) 63223.67 ( +35.30%)
THP fault fallback 21910.60 ( +0.00%) 5412.47 ( -75.29%)
Direct compact fail 195.80 ( +0.00%) 59.07 ( -69.48%)
Direct compact success 7.93 ( +0.00%) 2.80 ( -57.46%)
Direct compact success rate % 3.51 ( +0.00%) 3.99 ( +10.49%)
Compact daemon scanned migrate 3369601.27 ( +0.00%) 2267500.33 ( -32.71%)
Compact daemon scanned free 5075474.47 ( +0.00%) 2339773.00 ( -53.90%)
Compact direct scanned migrate 161787.27 ( +0.00%) 47659.93 ( -70.54%)
Compact direct scanned free 163467.53 ( +0.00%) 40729.67 ( -75.08%)
Compact total migrate scanned 3531388.53 ( +0.00%) 2315160.27 ( -34.44%)
Compact total free scanned 5238942.00 ( +0.00%) 2380502.67 ( -54.56%)
Alloc stall 2371.07 ( +0.00%) 638.87 ( -73.02%)
Pages kswapd scanned 2160926.73 ( +0.00%) 4002186.33 ( +85.21%)
Pages kswapd reclaimed 533191.07 ( +0.00%) 718577.80 ( +34.77%)
Pages direct scanned 400450.33 ( +0.00%) 355172.73 ( -11.31%)
Pages direct reclaimed 94441.73 ( +0.00%) 31162.80 ( -67.00%)
Pages total scanned 2561377.07 ( +0.00%) 4357359.07 ( +70.12%)
Pages total reclaimed 627632.80 ( +0.00%) 749740.60 ( +19.46%)
Swap out 47959.53 ( +0.00%) 110084.33 ( +129.53%)
Swap in 7276.00 ( +0.00%) 24457.00 ( +236.10%)
File refaults 138043.00 ( +0.00%) 188226.93 ( +36.35%)
THP latencies are cut in half, and failure rates are cut by 75%. These
metrics also hold up over time, while the vanilla kernel sees a steady
downward trend in success rates with each subsequent run, owed to the
cumulative effects of fragmentation.
A more detailed discussion of results is in the patch changelogs.
The patches first introduce a vm.defrag_mode sysctl, which enforces
the existing ALLOC_NOFRAGMENT alloc flag until after reclaim and
compaction have run. They then change kswapd and kcompactd to target
pageblocks, which boosts success in the ALLOC_NOFRAGMENT hotpaths.
Main differences to the RFC:
- The freelist hygiene patches have since been upstreamed separately.
- The RFC version would prohibit fallbacks entirely, and make
pageblock reclaim and compaction mandatory for all allocation
contexts. This opens up a large dependency graph for compaction,
possibly remaining sources of pollution, and the handling of
low-memory situations, OOMs and deadlocks.
This version uses only kswapd & kcompactd to pre-produce pageblocks,
while still allowing last-ditch fallbacks to avoid memory deadlocks.
The long-term goal remains converging on the version proposed in the
RFC and its ~100% THP success rate. But this is reserved for future
iterations that can build on the changes proposed here.
- The RFC version proposed a new MIGRATE_FREE type as well as
per-migratetype counters. This allowed making compaction more
efficient, and the pre-compaction gap checks more precise, but again
at the cost of complex changes in an already invasive series.
This series simply uses a new vmstat counter to track the number of
free pages in whole blocks to base reclaim/compaction goals on.
- The behavior is opt-in and can be toggled at runtime. The risk for
regressions with any allocator change is sizable, and while many
users care about huge pages, obviously not all do. A runtime knob is
warranted to make the behavior optional and provide an escape hatch.
Based on today's akpm/mm-unstable.
Patches #1 and #2 are somewhat unrelated cleanups, but touch the same
code and so included here to avoid conflicts from re-ordering.
Documentation/admin-guide/sysctl/vm.rst | 9 ++++
include/linux/compaction.h | 5 +-
include/linux/mmzone.h | 1 +
mm/compaction.c | 87 ++++++++++++++++++++-----------
mm/internal.h | 1 +
mm/page_alloc.c | 72 +++++++++++++++++++++----
mm/vmscan.c | 41 ++++++++++-----
mm/vmstat.c | 1 +
8 files changed, 161 insertions(+), 56 deletions(-)
Powered by blists - more mailing lists