[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <2be046e1-ef95-4244-ae23-e56071ae1218@nvidia.com>
Date: Mon, 4 Dec 2023 19:37:34 -0800
From: John Hubbard <jhubbard@...dia.com>
To: Ryan Roberts <ryan.roberts@....com>,
Andrew Morton <akpm@...ux-foundation.org>,
Matthew Wilcox <willy@...radead.org>,
"Yin Fengwei" <fengwei.yin@...el.com>,
David Hildenbrand <david@...hat.com>,
"Yu Zhao" <yuzhao@...gle.com>,
Catalin Marinas <catalin.marinas@....com>,
"Anshuman Khandual" <anshuman.khandual@....com>,
Yang Shi <shy828301@...il.com>,
"Huang, Ying" <ying.huang@...el.com>, Zi Yan <ziy@...dia.com>,
Luis Chamberlain <mcgrof@...nel.org>,
Itaru Kitayama <itaru.kitayama@...il.com>,
"Kirill A. Shutemov" <kirill.shutemov@...ux.intel.com>,
David Rientjes <rientjes@...gle.com>,
Vlastimil Babka <vbabka@...e.cz>,
Hugh Dickins <hughd@...gle.com>,
Kefeng Wang <wangkefeng.wang@...wei.com>,
Barry Song <21cnbao@...il.com>,
Alistair Popple <apopple@...dia.com>
CC: <linux-mm@...ck.org>, <linux-arm-kernel@...ts.infradead.org>,
<linux-kernel@...r.kernel.org>
Subject: Re: [PATCH v8 00/10] Multi-size THP for anonymous memory
On 12/4/23 02:20, Ryan Roberts wrote:
> Hi All,
>
> A new week, a new version, a new name... This is v8 of a series to implement
> multi-size THP (mTHP) for anonymous memory (previously called "small-sized THP"
> and "large anonymous folios"). Matthew objected to "small huge" so hopefully
> this fares better.
>
> The objective of this is to improve performance by allocating larger chunks of
> memory during anonymous page faults:
>
> 1) Since SW (the kernel) is dealing with larger chunks of memory than base
> pages, there are efficiency savings to be had; fewer page faults, batched PTE
> and RMAP manipulation, reduced lru list, etc. In short, we reduce kernel
> overhead. This should benefit all architectures.
> 2) Since we are now mapping physically contiguous chunks of memory, we can take
> advantage of HW TLB compression techniques. A reduction in TLB pressure
> speeds up kernel and user space. arm64 systems have 2 mechanisms to coalesce
> TLB entries; "the contiguous bit" (architectural) and HPA (uarch).
>
> This version changes the name and tidies up some of the kernel code and test
> code, based on feedback against v7 (see change log for details).
Using a couple of Armv8 systems, I've tested this patchset. I applied it
to top of tree (Linux 6.7-rc4), on top of your latest contig pte series
[1].
With those two patchsets applied, the mm selftests look OK--or at least
as OK as they normally do. I compared test runs between THP/mTHP set to
"always", vs "never", to verify that there were no new test failures.
Details: specifically, I set one particular page size (2 MB) to
"inherit", and then toggled /sys/kernel/mm/transparent_hugepage/enabled
between "always" and "never".
I also re-ran my usual compute/AI benchmark, and I'm still seeing the
same 10x performance improvement that I reported for the v6 patchset.
So for this patchset and for [1] as well, please feel free to add:
Tested-by: John Hubbard <jhubbard@...dia.com>
[1] https://lore.kernel.org/all/20231204105440.61448-1-ryan.roberts@arm.com/
thanks,
--
John Hubbard
NVIDIA
>
> By default, the existing behaviour (and performance) is maintained. The user
> must explicitly enable multi-size THP to see the performance benefit. This is
> done via a new sysfs interface (as recommended by David Hildenbrand - thanks to
> David for the suggestion)! This interface is inspired by the existing
> per-hugepage-size sysfs interface used by hugetlb, provides full backwards
> compatibility with the existing PMD-size THP interface, and provides a base for
> future extensibility. See [8] for detailed discussion of the interface.
>
> This series is based on mm-unstable (715b67adf4c8).
>
>
> Prerequisites
> =============
>
> Some work items identified as being prerequisites are listed on page 3 at [9].
> The summary is:
>
> | item | status |
> |:------------------------------|:------------------------|
> | mlock | In mainline (v6.7) |
> | madvise | In mainline (v6.6) |
> | compaction | v1 posted [10] |
> | numa balancing | Investigated: see below |
> | user-triggered page migration | In mainline (v6.7) |
> | khugepaged collapse | In mainline (NOP) |
>
> On NUMA balancing, which currently ignores any PTE-mapped THPs it encounters,
> John Hubbard has investigated this and concluded that it is A) not clear at the
> moment what a better policy might be for PTE-mapped THP and B) questions whether
> this should really be considered a prerequisite given no regression is caused
> for the default "multi-size THP disabled" case, and there is no correctness
> issue when it is enabled - its just a potential for non-optimal performance.
>
> If there are no disagreements about removing numa balancing from the list (none
> were raised when I first posted this comment against v7), then that just leaves
> compaction which is in review on list at the moment.
>
> I really would like to get this series (and its remaining comapction
> prerequisite) in for v6.8. I accept that it may be a bit optimistic at this
> point, but lets see where we get to with review?
>
>
> Testing
> =======
>
> The series includes patches for mm selftests to enlighten the cow and khugepaged
> tests to explicitly test with multi-size THP, in the same way that PMD-sized
> THP is tested. The new tests all pass, and no regressions are observed in the mm
> selftest suite. I've also run my usual kernel compilation and java script
> benchmarks without any issues.
>
> Refer to my performance numbers posted with v6 [6]. (These are for multi-size
> THP only - they do not include the arm64 contpte follow-on series).
>
> John Hubbard at Nvidia has indicated dramatic 10x performance improvements for
> some workloads at [11]. (Observed using v6 of this series as well as the arm64
> contpte series).
>
> Kefeng Wang at Huawei has also indicated he sees improvements at [12] although
> there are some latency regressions also.
>
>
> Changes since v7 [7]
> ====================
>
> - Renamed "small-sized THP" -> "multi-size THP" in commit logs
> - Added various Reviewed-by/Tested-by tags (Barry, David, Alistair)
> - Patch 3:
> - Fine-tuned transhuge documentation multi-size THP (JohnH)
> - Converted hugepage_global_enabled() and hugepage_global_always() macros
> to static inline functions (JohnH)
> - Renamed hugepage_vma_check() to thp_vma_allowable_orders() (JohnH)
> - Renamed transhuge_vma_suitable() to thp_vma_suitable_orders() (JohnH)
> - Renamed "global" enabled sysfs file option to "inherit" (JohnH)
> - Patch 9:
> - cow selftest: Renamed param size -> thpsize (David)
> - cow selftest: Changed test fail to assert() (David)
> - cow selftest: Log PMD size separately from all the supported THP sizes
> (David)
> - Patch 10:
> - cow selftest: No longer special case pmdsize; keep all THP sizes in
> thpsizes[]
>
>
> Changes since v6 [6]
> ====================
>
> - Refactored vmf_pte_range_changed() to remove uffd special-case (suggested by
> JohnH)
> - Dropped accounting patch (#3 in v6) (suggested by DavidH)
> - Continue to account *PMD-sized* THP only for now
> - Can add more counters in future if needed
> - Page cache large folios haven't needed any new counters yet
> - Pivot to sysfs ABI proposed by DavidH
> - per-size directories in a similar shape to that used by hugetlb
> - Dropped "recommend" keyword patch (#6 in v6) (suggested by DavidH, Yu Zhou)
> - For now, users need to understand implicitly which sizes are beneficial
> to their HW/SW
> - Dropped arch_wants_pte_order() patch (#7 in v6)
> - No longer needed due to dropping patch "recommend" keyword patch
> - Enlightened khugepaged mm selftest to explicitly test with small-size THP
> - Scrubbed commit logs to use "small-sized THP" consistently (suggested by
> DavidH)
>
>
> Changes since v5 [5]
> ====================
>
> - Added accounting for PTE-mapped THPs (patch 3)
> - Added runtime control mechanism via sysfs as extension to THP (patch 4)
> - Minor refactoring of alloc_anon_folio() to integrate with runtime controls
> - Stripped out hardcoded policy for allocation order; its now all user space
> controlled (although user space can request "recommend" which will configure
> the HW-preferred order)
>
>
> Changes since v4 [4]
> ====================
>
> - Removed "arm64: mm: Override arch_wants_pte_order()" patch; arm64
> now uses the default order-3 size. I have moved this patch over to
> the contpte series.
> - Added "mm: Allow deferred splitting of arbitrary large anon folios" back
> into series. I originally removed this at v2 to add to a separate series,
> but that series has transformed significantly and it no longer fits, so
> bringing it back here.
> - Reintroduced dependency on set_ptes(); Originally dropped this at v2, but
> set_ptes() is in mm-unstable now.
> - Updated policy for when to allocate LAF; only fallback to order-0 if
> MADV_NOHUGEPAGE is present or if THP disabled via prctl; no longer rely on
> sysfs's never/madvise/always knob.
> - Fallback to order-0 whenever uffd is armed for the vma, not just when
> uffd-wp is set on the pte.
> - alloc_anon_folio() now returns `struct folio *`, where errors are encoded
> with ERR_PTR().
>
> The last 3 changes were proposed by Yu Zhao - thanks!
>
>
> Changes since v3 [3]
> ====================
>
> - Renamed feature from FLEXIBLE_THP to LARGE_ANON_FOLIO.
> - Removed `flexthp_unhinted_max` boot parameter. Discussion concluded that a
> sysctl is preferable but we will wait until real workload needs it.
> - Fixed uninitialized `addr` on read fault path in do_anonymous_page().
> - Added mm selftests for large anon folios in cow test suite.
>
>
> Changes since v2 [2]
> ====================
>
> - Dropped commit "Allow deferred splitting of arbitrary large anon folios"
> - Huang, Ying suggested the "batch zap" work (which I dropped from this
> series after v1) is a prerequisite for merging FLXEIBLE_THP, so I've
> moved the deferred split patch to a separate series along with the batch
> zap changes. I plan to submit this series early next week.
> - Changed folio order fallback policy
> - We no longer iterate from preferred to 0 looking for acceptable policy
> - Instead we iterate through preferred, PAGE_ALLOC_COSTLY_ORDER and 0 only
> - Removed vma parameter from arch_wants_pte_order()
> - Added command line parameter `flexthp_unhinted_max`
> - clamps preferred order when vma hasn't explicitly opted-in to THP
> - Never allocate large folio for MADV_NOHUGEPAGE vma (or when THP is disabled
> for process or system).
> - Simplified implementation and integration with do_anonymous_page()
> - Removed dependency on set_ptes()
>
>
> Changes since v1 [1]
> ====================
>
> - removed changes to arch-dependent vma_alloc_zeroed_movable_folio()
> - replaced with arch-independent alloc_anon_folio()
> - follows THP allocation approach
> - no longer retry with intermediate orders if allocation fails
> - fallback directly to order-0
> - remove folio_add_new_anon_rmap_range() patch
> - instead add its new functionality to folio_add_new_anon_rmap()
> - remove batch-zap pte mappings optimization patch
> - remove enabler folio_remove_rmap_range() patch too
> - These offer real perf improvement so will submit separately
> - simplify Kconfig
> - single FLEXIBLE_THP option, which is independent of arch
> - depends on TRANSPARENT_HUGEPAGE
> - when enabled default to max anon folio size of 64K unless arch
> explicitly overrides
> - simplify changes to do_anonymous_page():
> - no more retry loop
>
>
> [1] https://lore.kernel.org/linux-mm/20230626171430.3167004-1-ryan.roberts@arm.com/
> [2] https://lore.kernel.org/linux-mm/20230703135330.1865927-1-ryan.roberts@arm.com/
> [3] https://lore.kernel.org/linux-mm/20230714160407.4142030-1-ryan.roberts@arm.com/
> [4] https://lore.kernel.org/linux-mm/20230726095146.2826796-1-ryan.roberts@arm.com/
> [5] https://lore.kernel.org/linux-mm/20230810142942.3169679-1-ryan.roberts@arm.com/
> [6] https://lore.kernel.org/linux-mm/20230929114421.3761121-1-ryan.roberts@arm.com/
> [7] https://lore.kernel.org/linux-mm/20231122162950.3854897-1-ryan.roberts@arm.com/
> [8] https://lore.kernel.org/linux-mm/6d89fdc9-ef55-d44e-bf12-fafff318aef8@redhat.com/
> [9] https://drive.google.com/file/d/1GnfYFpr7_c1kA41liRUW5YtCb8Cj18Ud/view?usp=sharing&resourcekey=0-U1Mj3-RhLD1JV6EThpyPyA
> [10] https://lore.kernel.org/linux-mm/20231113170157.280181-1-zi.yan@sent.com/
> [11] https://lore.kernel.org/linux-mm/c507308d-bdd4-5f9e-d4ff-e96e4520be85@nvidia.com/
> [12] https://lore.kernel.org/linux-mm/479b3e2b-456d-46c1-9677-38f6c95a0be8@huawei.com/
>
>
> Thanks,
> Ryan
>
> Ryan Roberts (10):
> mm: Allow deferred splitting of arbitrary anon large folios
> mm: Non-pmd-mappable, large folios for folio_add_new_anon_rmap()
> mm: thp: Introduce multi-size THP sysfs interface
> mm: thp: Support allocation of anonymous multi-size THP
> selftests/mm/kugepaged: Restore thp settings at exit
> selftests/mm: Factor out thp settings management
> selftests/mm: Support multi-size THP interface in thp_settings
> selftests/mm/khugepaged: Enlighten for multi-size THP
> selftests/mm/cow: Generalize do_run_with_thp() helper
> selftests/mm/cow: Add tests for anonymous multi-size THP
>
> Documentation/admin-guide/mm/transhuge.rst | 97 ++++-
> Documentation/filesystems/proc.rst | 6 +-
> fs/proc/task_mmu.c | 3 +-
> include/linux/huge_mm.h | 116 ++++--
> mm/huge_memory.c | 268 ++++++++++++--
> mm/khugepaged.c | 20 +-
> mm/memory.c | 114 +++++-
> mm/page_vma_mapped.c | 3 +-
> mm/rmap.c | 32 +-
> tools/testing/selftests/mm/Makefile | 4 +-
> tools/testing/selftests/mm/cow.c | 185 +++++++---
> tools/testing/selftests/mm/khugepaged.c | 410 ++++-----------------
> tools/testing/selftests/mm/run_vmtests.sh | 2 +
> tools/testing/selftests/mm/thp_settings.c | 349 ++++++++++++++++++
> tools/testing/selftests/mm/thp_settings.h | 80 ++++
> 15 files changed, 1177 insertions(+), 512 deletions(-)
> create mode 100644 tools/testing/selftests/mm/thp_settings.c
> create mode 100644 tools/testing/selftests/mm/thp_settings.h
>
> --
> 2.25.1
>
Powered by blists - more mailing lists