lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <20250313210647.1314586-1-hannes@cmpxchg.org>
Date: Thu, 13 Mar 2025 17:05:31 -0400
From: Johannes Weiner <hannes@...xchg.org>
To: Andrew Morton <akpm@...ux-foundation.org>
Cc: Vlastimil Babka <vbabka@...e.cz>,
	Mel Gorman <mgorman@...hsingularity.net>,
	Zi Yan <ziy@...dia.com>,
	linux-mm@...ck.org,
	linux-kernel@...r.kernel.org
Subject: [PATCH 0/5] mm: reliable huge page allocator

This series makes changes to the allocator and reclaim/compaction code
to try harder to avoid fragmentation. As a result, this makes huge
page allocations cheaper, more reliable and more sustainable.

It's a subset of the huge page allocator RFC initially proposed here:

  https://lore.kernel.org/lkml/20230418191313.268131-1-hannes@cmpxchg.org/

The following results are from a kernel build test, with additional
concurrent bursts of THP allocations on a memory-constrained system.
Comparing before and after the changes over 15 runs:

                                                     before                   after
    Hugealloc Time mean               52739.45 (    +0.00%)   28904.00 (   -45.19%)
    Hugealloc Time stddev             56541.26 (    +0.00%)   33464.37 (   -40.81%)
    Kbuild Real time                    197.47 (    +0.00%)     196.59 (    -0.44%)
    Kbuild User time                   1240.49 (    +0.00%)    1231.67 (    -0.71%)
    Kbuild System time                   70.08 (    +0.00%)      59.10 (   -15.45%)
    THP fault alloc                   46727.07 (    +0.00%)   63223.67 (   +35.30%)
    THP fault fallback                21910.60 (    +0.00%)    5412.47 (   -75.29%)
    Direct compact fail                 195.80 (    +0.00%)      59.07 (   -69.48%)
    Direct compact success                7.93 (    +0.00%)       2.80 (   -57.46%)
    Direct compact success rate %         3.51 (    +0.00%)       3.99 (   +10.49%)
    Compact daemon scanned migrate  3369601.27 (    +0.00%) 2267500.33 (   -32.71%)
    Compact daemon scanned free     5075474.47 (    +0.00%) 2339773.00 (   -53.90%)
    Compact direct scanned migrate   161787.27 (    +0.00%)   47659.93 (   -70.54%)
    Compact direct scanned free      163467.53 (    +0.00%)   40729.67 (   -75.08%)
    Compact total migrate scanned   3531388.53 (    +0.00%) 2315160.27 (   -34.44%)
    Compact total free scanned      5238942.00 (    +0.00%) 2380502.67 (   -54.56%)
    Alloc stall                        2371.07 (    +0.00%)     638.87 (   -73.02%)
    Pages kswapd scanned            2160926.73 (    +0.00%) 4002186.33 (   +85.21%)
    Pages kswapd reclaimed           533191.07 (    +0.00%)  718577.80 (   +34.77%)
    Pages direct scanned             400450.33 (    +0.00%)  355172.73 (   -11.31%)
    Pages direct reclaimed            94441.73 (    +0.00%)   31162.80 (   -67.00%)
    Pages total scanned             2561377.07 (    +0.00%) 4357359.07 (   +70.12%)
    Pages total reclaimed            627632.80 (    +0.00%)  749740.60 (   +19.46%)
    Swap out                          47959.53 (    +0.00%)  110084.33 (  +129.53%)
    Swap in                            7276.00 (    +0.00%)   24457.00 (  +236.10%)
    File refaults                    138043.00 (    +0.00%)  188226.93 (   +36.35%)

THP latencies are cut in half, and failure rates are cut by 75%. These
metrics also hold up over time, while the vanilla kernel sees a steady
downward trend in success rates with each subsequent run, owed to the
cumulative effects of fragmentation.

A more detailed discussion of results is in the patch changelogs.

The patches first introduce a vm.defrag_mode sysctl, which enforces
the existing ALLOC_NOFRAGMENT alloc flag until after reclaim and
compaction have run. They then change kswapd and kcompactd to target
pageblocks, which boosts success in the ALLOC_NOFRAGMENT hotpaths.

Main differences to the RFC:

- The freelist hygiene patches have since been upstreamed separately.

- The RFC version would prohibit fallbacks entirely, and make
  pageblock reclaim and compaction mandatory for all allocation
  contexts. This opens up a large dependency graph for compaction,
  possibly remaining sources of pollution, and the handling of
  low-memory situations, OOMs and deadlocks.

  This version uses only kswapd & kcompactd to pre-produce pageblocks,
  while still allowing last-ditch fallbacks to avoid memory deadlocks.

  The long-term goal remains converging on the version proposed in the
  RFC and its ~100% THP success rate. But this is reserved for future
  iterations that can build on the changes proposed here.

- The RFC version proposed a new MIGRATE_FREE type as well as
  per-migratetype counters. This allowed making compaction more
  efficient, and the pre-compaction gap checks more precise, but again
  at the cost of complex changes in an already invasive series.

  This series simply uses a new vmstat counter to track the number of
  free pages in whole blocks to base reclaim/compaction goals on.

- The behavior is opt-in and can be toggled at runtime. The risk for
  regressions with any allocator change is sizable, and while many
  users care about huge pages, obviously not all do. A runtime knob is
  warranted to make the behavior optional and provide an escape hatch.

Based on today's akpm/mm-unstable.

Patches #1 and #2 are somewhat unrelated cleanups, but touch the same
code and so included here to avoid conflicts from re-ordering.

 Documentation/admin-guide/sysctl/vm.rst |  9 ++++
 include/linux/compaction.h              |  5 +-
 include/linux/mmzone.h                  |  1 +
 mm/compaction.c                         | 87 ++++++++++++++++++++-----------
 mm/internal.h                           |  1 +
 mm/page_alloc.c                         | 72 +++++++++++++++++++++----
 mm/vmscan.c                             | 41 ++++++++++-----
 mm/vmstat.c                             |  1 +
 8 files changed, 161 insertions(+), 56 deletions(-)


Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ