lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <20240826204353.2228736-1-peterx@redhat.com>
Date: Mon, 26 Aug 2024 16:43:34 -0400
From: Peter Xu <peterx@...hat.com>
To: linux-kernel@...r.kernel.org,
	linux-mm@...ck.org
Cc: Gavin Shan <gshan@...hat.com>,
	Catalin Marinas <catalin.marinas@....com>,
	x86@...nel.org,
	Ingo Molnar <mingo@...hat.com>,
	Andrew Morton <akpm@...ux-foundation.org>,
	Paolo Bonzini <pbonzini@...hat.com>,
	Dave Hansen <dave.hansen@...ux.intel.com>,
	Thomas Gleixner <tglx@...utronix.de>,
	Alistair Popple <apopple@...dia.com>,
	kvm@...r.kernel.org,
	linux-arm-kernel@...ts.infradead.org,
	Sean Christopherson <seanjc@...gle.com>,
	peterx@...hat.com,
	Oscar Salvador <osalvador@...e.de>,
	Jason Gunthorpe <jgg@...dia.com>,
	Borislav Petkov <bp@...en8.de>,
	Zi Yan <ziy@...dia.com>,
	Axel Rasmussen <axelrasmussen@...gle.com>,
	David Hildenbrand <david@...hat.com>,
	Yan Zhao <yan.y.zhao@...el.com>,
	Will Deacon <will@...nel.org>,
	Kefeng Wang <wangkefeng.wang@...wei.com>,
	Alex Williamson <alex.williamson@...hat.com>
Subject: [PATCH v2 00/19] mm: Support huge pfnmaps

v2:
- Added tags
- Let folio_walk_start() scan special pmd/pud bits [DavidH]
- Switch copy_huge_pmd() COW+writable check into a VM_WARN_ON_ONCE()
- Update commit message to drop mentioning of gup-fast, in patch "mm: Mark
  special bits for huge pfn mappings when inject" [JasonG]
- In gup-fast, reorder _special check v.s. _devmap check, so as to make
  pmd/pud path look the same as pte path [DavidH, JasonG]
- Enrich comments for follow_pfnmap*() API, emphasize the risk when PFN is
  used after the end() is invoked, s/-ve/negative/ [JasonG, Sean]

Overview
========

This series is based on mm-unstable, commit b659edec079c of Aug 26th
latest, with patch "vma remove the unneeded avc bound with non-CoWed folio"
reverted, as reported broken [0].

This series implements huge pfnmaps support for mm in general.  Huge pfnmap
allows e.g. VM_PFNMAP vmas to map in either PMD or PUD levels, similar to
what we do with dax / thp / hugetlb so far to benefit from TLB hits.  Now
we extend that idea to PFN mappings, e.g. PCI MMIO bars where it can grow
as large as 8GB or even bigger.

Currently, only x86_64 (1G+2M) and arm64 (2M) are supported.  The last
patch (from Alex Williamson) will be the first user of huge pfnmap, so as
to enable vfio-pci driver to fault in huge pfn mappings.

Implementation
==============

In reality, it's relatively simple to add such support comparing to many
other types of mappings, because of PFNMAP's specialties when there's no
vmemmap backing it, so that most of the kernel routines on huge mappings
should simply already fail for them, like GUPs or old-school follow_page()
(which is recently rewritten to be folio_walk* APIs by David).

One trick here is that we're still unmature on PUDs in generic paths here
and there, as DAX is so far the only user.  This patchset will add the 2nd
user of it.  Hugetlb can be a 3rd user if the hugetlb unification work can
go on smoothly, but to be discussed later.

The other trick is how to allow gup-fast working for such huge mappings
even if there's no direct sign of knowing whether it's a normal page or
MMIO mapping.  This series chose to keep the pte_special solution, so that
it reuses similar idea on setting a special bit to pfnmap PMDs/PUDs so that
gup-fast will be able to identify them and fail properly.

Along the way, we'll also notice that the major pgtable pfn walker, aka,
follow_pte(), will need to retire soon due to the fact that it only works
with ptes.  A new set of simple API is introduced (follow_pfnmap* API) to
be able to do whatever follow_pte() can already do, plus that it can also
process huge pfnmaps now.  Half of this series is about that and converting
all existing pfnmap walkers to use the new API properly.  Hopefully the new
API also looks better to avoid exposing e.g. pgtable lock details into the
callers, so that it can be used in an even more straightforward way.

Here, three more options will be introduced and involved in huge pfnmap:

  - ARCH_SUPPORTS_HUGE_PFNMAP

    Arch developers will need to select this option when huge pfnmap is
    supported in arch's Kconfig.  After this patchset applied, both x86_64
    and arm64 will start to enable it by default.

  - ARCH_SUPPORTS_PMD_PFNMAP / ARCH_SUPPORTS_PUD_PFNMAP

    These options are for driver developers to identify whether current
    arch / config supports huge pfnmaps, making decision on whether it can
    use the huge pfnmap APIs to inject them.  One can refer to the last
    vfio-pci patch from Alex on the use of them properly in a device
    driver.

So after the whole set applied, and if one would enable some dynamic debug
lines in vfio-pci core files, we should observe things like:

  vfio-pci 0000:00:06.0: vfio_pci_mmap_huge_fault(,order = 9) BAR 0 page offset 0x0: 0x100
  vfio-pci 0000:00:06.0: vfio_pci_mmap_huge_fault(,order = 9) BAR 0 page offset 0x200: 0x100
  vfio-pci 0000:00:06.0: vfio_pci_mmap_huge_fault(,order = 9) BAR 0 page offset 0x400: 0x100

In this specific case, it says that vfio-pci faults in PMDs properly for a
few BAR0 offsets.

Patch Layout
============

Patch 1:         Introduce the new options mentioned above for huge PFNMAPs
Patch 2:         A tiny cleanup
Patch 3-8:       Preparation patches for huge pfnmap (include introduce
                 special bit for pmd/pud)
Patch 9-16:      Introduce follow_pfnmap*() API, use it everywhere, and
                 then drop follow_pte() API
Patch 17:        Add huge pfnmap support for x86_64
Patch 18:        Add huge pfnmap support for arm64
Patch 19:        Add vfio-pci support for all kinds of huge pfnmaps (Alex)

TODO
====

More architectures / More page sizes
------------------------------------

Currently only x86_64 (2M+1G) and arm64 (2M) are supported.  There seems to
have plan to support arm64 1G later on top of this series [2].

Any arch will need to first support THP / THP_1G, then provide a special
bit in pmds/puds to support huge pfnmaps.

remap_pfn_range() support
-------------------------

Currently, remap_pfn_range() still only maps PTEs.  With the new option,
remap_pfn_range() can logically start to inject either PMDs or PUDs when
the alignment requirements match on the VAs.

When the support is there, it should be able to silently benefit all
drivers that is using remap_pfn_range() in its mmap() handler on better TLB
hit rate and overall faster MMIO accesses similar to processor on hugepages.

More driver support
-------------------

VFIO is so far the only consumer for the huge pfnmaps after this series
applied.  Besides above remap_pfn_range() generic optimization, device
driver can also try to optimize its mmap() on a better VA alignment for
either PMD/PUD sizes.  This may, iiuc, normally require userspace changes,
as the driver doesn't normally decide the VA to map a bar.  But I don't
think I know all the drivers to know the full picture.

Tests Done
==========

- Cross-build tests

- run_vmtests.sh

- Hacked e1000e QEMU with 128MB BAR 0, with some prefault test, mprotect()
  and fork() tests on the bar mapped

- x86_64 + AMD GPU
  - Needs Alex's modified QEMU to guarantee proper VA alignment to make
    sure all pages to be mapped with PUDs
  - Main BAR (8GB) start to use PUD mappings
  - Sub BAR (??MBs?) start to use PMD mappings
  - Performance wise, slight improvement comparing to the old PTE mappings

- aarch64 + NIC
  - Detached NIC test to make sure driver loads fine with PMD mappings

Credits all go to Alex on help testing the GPU/NIC use cases above.

Comments welcomed, thanks.

[0] https://lore.kernel.org/r/73ad9540-3fb8-4154-9a4f-30a0a2b03d41@lucifer.local
[1] https://lore.kernel.org/r/20240807194812.819412-1-peterx@redhat.com
[2] https://lore.kernel.org/r/498e0731-81a4-4f75-95b4-a8ad0bcc7665@huawei.com

Alex Williamson (1):
  vfio/pci: Implement huge_fault support

Peter Xu (18):
  mm: Introduce ARCH_SUPPORTS_HUGE_PFNMAP and special bits to pmd/pud
  mm: Drop is_huge_zero_pud()
  mm: Mark special bits for huge pfn mappings when inject
  mm: Allow THP orders for PFNMAPs
  mm/gup: Detect huge pfnmap entries in gup-fast
  mm/pagewalk: Check pfnmap for folio_walk_start()
  mm/fork: Accept huge pfnmap entries
  mm: Always define pxx_pgprot()
  mm: New follow_pfnmap API
  KVM: Use follow_pfnmap API
  s390/pci_mmio: Use follow_pfnmap API
  mm/x86/pat: Use the new follow_pfnmap API
  vfio: Use the new follow_pfnmap API
  acrn: Use the new follow_pfnmap API
  mm/access_process_vm: Use the new follow_pfnmap API
  mm: Remove follow_pte()
  mm/x86: Support large pfn mappings
  mm/arm64: Support large pfn mappings

 arch/arm64/Kconfig                  |   1 +
 arch/arm64/include/asm/pgtable.h    |  30 +++++
 arch/powerpc/include/asm/pgtable.h  |   1 +
 arch/s390/include/asm/pgtable.h     |   1 +
 arch/s390/pci/pci_mmio.c            |  22 ++--
 arch/sparc/include/asm/pgtable_64.h |   1 +
 arch/x86/Kconfig                    |   1 +
 arch/x86/include/asm/pgtable.h      |  80 +++++++-----
 arch/x86/mm/pat/memtype.c           |  17 ++-
 drivers/vfio/pci/vfio_pci_core.c    |  60 ++++++---
 drivers/vfio/vfio_iommu_type1.c     |  16 +--
 drivers/virt/acrn/mm.c              |  16 +--
 include/linux/huge_mm.h             |  16 +--
 include/linux/mm.h                  |  57 ++++++++-
 include/linux/pgtable.h             |  12 ++
 mm/Kconfig                          |  13 ++
 mm/gup.c                            |   6 +
 mm/huge_memory.c                    |  50 +++++---
 mm/memory.c                         | 183 ++++++++++++++++++++--------
 mm/pagewalk.c                       |   4 +-
 virt/kvm/kvm_main.c                 |  19 ++-
 21 files changed, 425 insertions(+), 181 deletions(-)

-- 
2.45.0


Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ