lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <20240214204435.167852-1-david@redhat.com>
Date: Wed, 14 Feb 2024 21:44:25 +0100
From: David Hildenbrand <david@...hat.com>
To: linux-kernel@...r.kernel.org
Cc: linux-mm@...ck.org,
	David Hildenbrand <david@...hat.com>,
	Andrew Morton <akpm@...ux-foundation.org>,
	Matthew Wilcox <willy@...radead.org>,
	Ryan Roberts <ryan.roberts@....com>,
	Catalin Marinas <catalin.marinas@....com>,
	Yin Fengwei <fengwei.yin@...el.com>,
	Michal Hocko <mhocko@...e.com>,
	Will Deacon <will@...nel.org>,
	"Aneesh Kumar K.V" <aneesh.kumar@...ux.ibm.com>,
	Nick Piggin <npiggin@...il.com>,
	Peter Zijlstra <peterz@...radead.org>,
	Michael Ellerman <mpe@...erman.id.au>,
	Christophe Leroy <christophe.leroy@...roup.eu>,
	"Naveen N. Rao" <naveen.n.rao@...ux.ibm.com>,
	Heiko Carstens <hca@...ux.ibm.com>,
	Vasily Gorbik <gor@...ux.ibm.com>,
	Alexander Gordeev <agordeev@...ux.ibm.com>,
	Christian Borntraeger <borntraeger@...ux.ibm.com>,
	Sven Schnelle <svens@...ux.ibm.com>,
	Arnd Bergmann <arnd@...db.de>,
	linux-arch@...r.kernel.org,
	linuxppc-dev@...ts.ozlabs.org,
	linux-s390@...r.kernel.org
Subject: [PATCH v3 00/10] mm/memory: optimize unmap/zap with PTE-mapped THP

This series is based on [1]. Similar to what we did with fork(), let's
implement PTE batching during unmap/zap when processing PTE-mapped THPs.

We collect consecutive PTEs that map consecutive pages of the same large
folio, making sure that the other PTE bits are compatible, and (a) adjust
the refcount only once per batch, (b) call rmap handling functions only
once per batch, (c) perform batch PTE setting/updates and (d) perform TLB
entry removal once per batch.

Ryan was previously working on this in the context of cont-pte for
arm64, int latest iteration [2] with a focus on arm6 with cont-pte only.
This series implements the optimization for all architectures, independent
of such PTE bits, teaches MMU gather/TLB code to be fully aware of such
large-folio-pages batches as well, and amkes use of our new rmap batching
function when removing the rmap.

To achieve that, we have to enlighten MMU gather / page freeing code
(i.e., everything that consumes encoded_page) to process unmapping
of consecutive pages that all belong to the same large folio. I'm being
very careful to not degrade order-0 performance, and it looks like I
managed to achieve that.

While this series should -- similar to [1] -- be beneficial for adding
cont-pte support on arm64[2], it's one of the requirements for maintaining
a total mapcount[3] for large folios with minimal added overhead and
further changes[4] that build up on top of the total mapcount.

Independent of all that, this series results in a speedup during munmap()
and similar unmapping (process teardown, MADV_DONTNEED on larger ranges)
with PTE-mapped THP, which is the default with THPs that are smaller than
a PMD (for example, 16KiB to 1024KiB mTHPs for anonymous memory[5]).

On an Intel Xeon Silver 4210R CPU, munmap'ing a 1GiB VMA backed by
PTE-mapped folios of the same size (stddev < 1%) results in the following
runtimes for munmap() in seconds (shorter is better):

Folio Size | mm-unstable |      New | Change
---------------------------------------------
      4KiB |    0.058110 | 0.057715 |   - 1%
     16KiB |    0.044198 | 0.035469 |   -20%
     32KiB |    0.034216 | 0.023522 |   -31%
     64KiB |    0.029207 | 0.018434 |   -37%
    128KiB |    0.026579 | 0.014026 |   -47%
    256KiB |    0.025130 | 0.011756 |   -53%
    512KiB |    0.024292 | 0.010703 |   -56%
   1024KiB |    0.023812 | 0.010294 |   -57%
   2048KiB |    0.023785 | 0.009910 |   -58%

CCing especially s390x folks, because they have a tlb freeing hooks that
needs adjustment. Only tested on x86-64 for now, will have to do some more
stress testing. Compile-tested on most other architectures. The PPC
change is negleglible and makes my cross-compiler happy.

[1] https://lkml.kernel.org/r/20240129124649.189745-1-david@redhat.com
[2] https://lkml.kernel.org/r/20231218105100.172635-1-ryan.roberts@arm.com
[3] https://lkml.kernel.org/r/20230809083256.699513-1-david@redhat.com
[4] https://lkml.kernel.org/r/20231124132626.235350-1-david@redhat.com
[5] https://lkml.kernel.org/r/20231207161211.2374093-1-ryan.roberts@arm.com

---

Sending this out earlier than I ususally would, so we can get this into
mm-unstable for Ryan to base his cont-pte work on this ASAP.

The performance numbers are from v1. I did a quick benchmark run of v3
and nothing significantly changed, relevant code paths remained unchanged.

v2 -> v3:
* "mm/mmu_gather: add __tlb_remove_folio_pages()"
 -> Slightly adjusted patch description
* "mm/mmu_gather: improve cond_resched() handling with large folios and
   expensive page freeing"
 -> Use new macro for magic value and avoid code duplication
 -> Extend patch description
* Pick up RB's

v1 -> v2:
* "mm/memory: factor out zapping of present pte into zap_present_pte()"
 -> Initialize "struct folio *folio" to NULL
* "mm/memory: handle !page case in zap_present_pte() separately"
 -> Extend description regarding arch_check_zapped_pte()
* "mm/mmu_gather: add __tlb_remove_folio_pages()"
 -> ENCODED_PAGE_BIT_NR_PAGES_NEXT
 -> Extend patch description regarding "batching more"
* "mm/mmu_gather: improve cond_resched() handling with large folios and
   expensive page freeing"
 -> Handle the (so far) theoretical case of possible soft lockups when
    we zero/poison memory when freeing pages. Try to keep old behavior in
    that corner case to be safe.
* "mm/memory: optimize unmap/zap with PTE-mapped THP"
 -> Clarify description of new ptep clearing functions regarding "present
    PTEs"
 -> Extend patch description regarding relaxed mapcount sanity checks
 -> Improve zap_present_ptes() description
* Pick up RB's

Cc: Andrew Morton <akpm@...ux-foundation.org>
Cc: Matthew Wilcox (Oracle) <willy@...radead.org>
Cc: Ryan Roberts <ryan.roberts@....com>
Cc: Catalin Marinas <catalin.marinas@....com>
Cc: Yin Fengwei <fengwei.yin@...el.com>
Cc: Michal Hocko <mhocko@...e.com>
Cc: Will Deacon <will@...nel.org>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@...ux.ibm.com>
Cc: Nick Piggin <npiggin@...il.com>
Cc: Peter Zijlstra <peterz@...radead.org>
Cc: Michael Ellerman <mpe@...erman.id.au>
Cc: Christophe Leroy <christophe.leroy@...roup.eu>
Cc: "Naveen N. Rao" <naveen.n.rao@...ux.ibm.com>
Cc: Heiko Carstens <hca@...ux.ibm.com>
Cc: Vasily Gorbik <gor@...ux.ibm.com>
Cc: Alexander Gordeev <agordeev@...ux.ibm.com>
Cc: Christian Borntraeger <borntraeger@...ux.ibm.com>
Cc: Sven Schnelle <svens@...ux.ibm.com>
Cc: Arnd Bergmann <arnd@...db.de>
Cc: linux-arch@...r.kernel.org
Cc: linuxppc-dev@...ts.ozlabs.org
Cc: linux-s390@...r.kernel.org

David Hildenbrand (10):
  mm/memory: factor out zapping of present pte into zap_present_pte()
  mm/memory: handle !page case in zap_present_pte() separately
  mm/memory: further separate anon and pagecache folio handling in
    zap_present_pte()
  mm/memory: factor out zapping folio pte into zap_present_folio_pte()
  mm/mmu_gather: pass "delay_rmap" instead of encoded page to
    __tlb_remove_page_size()
  mm/mmu_gather: define ENCODED_PAGE_FLAG_DELAY_RMAP
  mm/mmu_gather: add tlb_remove_tlb_entries()
  mm/mmu_gather: add __tlb_remove_folio_pages()
  mm/mmu_gather: improve cond_resched() handling with large folios and
    expensive page freeing
  mm/memory: optimize unmap/zap with PTE-mapped THP

 arch/powerpc/include/asm/tlb.h |   2 +
 arch/s390/include/asm/tlb.h    |  30 ++++--
 include/asm-generic/tlb.h      |  40 ++++++--
 include/linux/mm_types.h       |  37 ++++++--
 include/linux/pgtable.h        |  70 ++++++++++++++
 mm/memory.c                    | 169 +++++++++++++++++++++++----------
 mm/mmu_gather.c                | 111 ++++++++++++++++++----
 mm/swap.c                      |  12 ++-
 mm/swap_state.c                |  15 ++-
 9 files changed, 393 insertions(+), 93 deletions(-)


base-commit: 7e56cf9a7f108e8129d75cea0dabc9488fb4defa
-- 
2.43.0


Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ