[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <ea345a14-0a39-425c-a2df-d163ca948f57@nvidia.com>
Date: Mon, 4 Dec 2023 19:41:02 -0800
From: John Hubbard <jhubbard@...dia.com>
To: Ryan Roberts <ryan.roberts@....com>,
Catalin Marinas <catalin.marinas@....com>,
Will Deacon <will@...nel.org>,
Ard Biesheuvel <ardb@...nel.org>,
Marc Zyngier <maz@...nel.org>,
Oliver Upton <oliver.upton@...ux.dev>,
James Morse <james.morse@....com>,
Suzuki K Poulose <suzuki.poulose@....com>,
Zenghui Yu <yuzenghui@...wei.com>,
Andrey Ryabinin <ryabinin.a.a@...il.com>,
Alexander Potapenko <glider@...gle.com>,
"Andrey Konovalov" <andreyknvl@...il.com>,
Dmitry Vyukov <dvyukov@...gle.com>,
Vincenzo Frascino <vincenzo.frascino@....com>,
Andrew Morton <akpm@...ux-foundation.org>,
Anshuman Khandual <anshuman.khandual@....com>,
Matthew Wilcox <willy@...radead.org>,
Yu Zhao <yuzhao@...gle.com>,
"Mark Rutland" <mark.rutland@....com>,
David Hildenbrand <david@...hat.com>,
"Kefeng Wang" <wangkefeng.wang@...wei.com>,
Zi Yan <ziy@...dia.com>, Barry Song <21cnbao@...il.com>,
Alistair Popple <apopple@...dia.com>,
Yang Shi <shy828301@...il.com>
CC: <linux-arm-kernel@...ts.infradead.org>, <linux-mm@...ck.org>,
<linux-kernel@...r.kernel.org>
Subject: Re: [PATCH v3 00/15] Transparent Contiguous PTEs for User Mappings
On 12/4/23 02:54, Ryan Roberts wrote:
> Hi All,
>
> This is v3 of a series to opportunistically and transparently use contpte
> mappings (set the contiguous bit in ptes) for user memory when those mappings
> meet the requirements. It is part of a wider effort to improve performance by
> allocating and mapping variable-sized blocks of memory (folios). One aim is for
> the 4K kernel to approach the performance of the 16K kernel, but without
> breaking compatibility and without the associated increase in memory. Another
> aim is to benefit the 16K and 64K kernels by enabling 2M THP, since this is the
> contpte size for those kernels. We have good performance data that demonstrates
> both aims are being met (see below).
>
> Of course this is only one half of the change. We require the mapped physical
> memory to be the correct size and alignment for this to actually be useful (i.e.
> 64K for 4K pages, or 2M for 16K/64K pages). Fortunately folios are solving this
> problem for us. Filesystems that support it (XFS, AFS, EROFS, tmpfs, ...) will
> allocate large folios up to the PMD size today, and more filesystems are coming.
> And the other half of my work, to enable "multi-size THP" (large folios) for
> anonymous memory, makes contpte sized folios prevalent for anonymous memory too
> [3].
>
Hi Ryan,
Using a couple of Armv8 systems, I've tested this patchset. Details are in my
reply to the mTHP patchset [1].
So for this patchset, please feel free to add:
Tested-by: John Hubbard <jhubbard@...dia.com>
[1] https://lore.kernel.org/all/2be046e1-ef95-4244-ae23-e56071ae1218@nvidia.com/
thanks,
--
John Hubbard
NVIDIA
> Optimistically, I would really like to get this series merged for v6.8; there is
> a chance that the multi-size THP series will also get merged for that version
> (although at this point pretty small). But even if it doesn't, this series still
> benefits file-backed memory from the file systems that support large folios so
> shouldn't be held up for it. Additionally I've got data that shows this series
> adds no regression when the system has no appropriate large folios.
>
> All dependecies listed against v1 are now resolved; This series applies cleanly
> against v6.7-rc1.
>
> Note that the first two patchs are for core-mm and provides the refactoring to
> make some crucial optimizations possible - which are then implemented in patches
> 14 and 15. The remaining patches are arm64-specific.
>
> Testing
> =======
>
> I've tested this series together with multi-size THP [3] on both Ampere Altra
> (bare metal) and Apple M2 (VM):
> - mm selftests (inc new tests written for multi-size THP); no regressions
> - Speedometer Java script benchmark in Chromium web browser; no issues
> - Kernel compilation; no issues
> - Various tests under high memory pressure with swap enabled; no issues
>
>
> Performance
> ===========
>
> John Hubbard at Nvidia has indicated dramatic 10x performance improvements for
> some workloads at [4], when using 64K base page kernel.
>
> You can also see the original performance results I posted against v1 [1] which
> are still valid.
>
> I've additionally run the kernel compilation and speedometer benchmarks on a
> system with multi-size THP disabled and large folio support for file-backed
> memory intentionally disabled; I see no change in performance in this case (i.e.
> no regression when this change is "present but not useful").
>
>
> Changes since v2 [2]
> ====================
>
> - Removed contpte_ptep_get_and_clear_full() optimisation for exit() (v2#14),
> and replaced with a batch-clearing approach using a new arch helper,
> clear_ptes() (v3#2 and v3#15) (Alistair and Barry)
> - (v2#1 / v3#1)
> - Fixed folio refcounting so that refcount >= mapcount always (DavidH)
> - Reworked batch demarcation to avoid pte_pgprot() (DavidH)
> - Reverted return semantic of copy_present_page() and instead fix it up in
> copy_present_ptes() (Alistair)
> - Removed page_cont_mapped_vaddr() and replaced with simpler logic
> (Alistair)
> - Made batch accounting clearer in copy_pte_range() (Alistair)
> - (v2#12 / v3#13)
> - Renamed contpte_fold() -> contpte_convert() and hoisted setting/
> clearing CONT_PTE bit to higher level (Alistair)
>
>
> Changes since v1 [1]
> ====================
>
> - Export contpte_* symbols so that modules can continue to call inline
> functions (e.g. ptep_get) which may now call the contpte_* functions (thanks
> to JohnH)
> - Use pte_valid() instead of pte_present() where sensible (thanks to Catalin)
> - Factor out (pte_valid() && pte_cont()) into new pte_valid_cont() helper
> (thanks to Catalin)
> - Fixed bug in contpte_ptep_set_access_flags() where TLBIs were missed (thanks
> to Catalin)
> - Added ARM64_CONTPTE expert Kconfig (enabled by default) (thanks to Anshuman)
> - Simplified contpte_ptep_get_and_clear_full()
> - Improved various code comments
>
>
> [1] https://lore.kernel.org/linux-arm-kernel/20230622144210.2623299-1-ryan.roberts@arm.com/
> [2] https://lore.kernel.org/linux-arm-kernel/20231115163018.1303287-1-ryan.roberts@arm.com/
> [3] https://lore.kernel.org/linux-arm-kernel/20231204102027.57185-1-ryan.roberts@arm.com/
> [4] https://lore.kernel.org/linux-mm/c507308d-bdd4-5f9e-d4ff-e96e4520be85@nvidia.com/
>
>
> Thanks,
> Ryan
>
> Ryan Roberts (15):
> mm: Batch-copy PTE ranges during fork()
> mm: Batch-clear PTE ranges during zap_pte_range()
> arm64/mm: set_pte(): New layer to manage contig bit
> arm64/mm: set_ptes()/set_pte_at(): New layer to manage contig bit
> arm64/mm: pte_clear(): New layer to manage contig bit
> arm64/mm: ptep_get_and_clear(): New layer to manage contig bit
> arm64/mm: ptep_test_and_clear_young(): New layer to manage contig bit
> arm64/mm: ptep_clear_flush_young(): New layer to manage contig bit
> arm64/mm: ptep_set_wrprotect(): New layer to manage contig bit
> arm64/mm: ptep_set_access_flags(): New layer to manage contig bit
> arm64/mm: ptep_get(): New layer to manage contig bit
> arm64/mm: Split __flush_tlb_range() to elide trailing DSB
> arm64/mm: Wire up PTE_CONT for user mappings
> arm64/mm: Implement ptep_set_wrprotects() to optimize fork()
> arm64/mm: Implement clear_ptes() to optimize exit()
>
> arch/arm64/Kconfig | 10 +-
> arch/arm64/include/asm/pgtable.h | 343 ++++++++++++++++++++---
> arch/arm64/include/asm/tlbflush.h | 13 +-
> arch/arm64/kernel/efi.c | 4 +-
> arch/arm64/kernel/mte.c | 2 +-
> arch/arm64/kvm/guest.c | 2 +-
> arch/arm64/mm/Makefile | 1 +
> arch/arm64/mm/contpte.c | 436 ++++++++++++++++++++++++++++++
> arch/arm64/mm/fault.c | 12 +-
> arch/arm64/mm/fixmap.c | 4 +-
> arch/arm64/mm/hugetlbpage.c | 40 +--
> arch/arm64/mm/kasan_init.c | 6 +-
> arch/arm64/mm/mmu.c | 16 +-
> arch/arm64/mm/pageattr.c | 6 +-
> arch/arm64/mm/trans_pgd.c | 6 +-
> include/asm-generic/tlb.h | 9 +
> include/linux/pgtable.h | 39 +++
> mm/memory.c | 258 +++++++++++++-----
> mm/mmu_gather.c | 14 +
> 19 files changed, 1067 insertions(+), 154 deletions(-)
> create mode 100644 arch/arm64/mm/contpte.c
>
> --
> 2.25.1
>
Powered by blists - more mailing lists