[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CACw3F50Zi7CQsSOcCutRUy1h5p=7UBw7ZRGm4WayvsnuuEnKow@mail.gmail.com>
Date: Tue, 27 Aug 2024 15:36:07 -0700
From: Jiaqi Yan <jiaqiyan@...gle.com>
To: Peter Xu <peterx@...hat.com>
Cc: linux-kernel@...r.kernel.org, linux-mm@...ck.org,
Gavin Shan <gshan@...hat.com>, Catalin Marinas <catalin.marinas@....com>, x86@...nel.org,
Ingo Molnar <mingo@...hat.com>, Andrew Morton <akpm@...ux-foundation.org>,
Paolo Bonzini <pbonzini@...hat.com>, Dave Hansen <dave.hansen@...ux.intel.com>,
Thomas Gleixner <tglx@...utronix.de>, Alistair Popple <apopple@...dia.com>, kvm@...r.kernel.org,
linux-arm-kernel@...ts.infradead.org, Sean Christopherson <seanjc@...gle.com>,
Oscar Salvador <osalvador@...e.de>, Jason Gunthorpe <jgg@...dia.com>, Borislav Petkov <bp@...en8.de>,
Zi Yan <ziy@...dia.com>, Axel Rasmussen <axelrasmussen@...gle.com>,
David Hildenbrand <david@...hat.com>, Yan Zhao <yan.y.zhao@...el.com>, Will Deacon <will@...nel.org>,
Kefeng Wang <wangkefeng.wang@...wei.com>, Alex Williamson <alex.williamson@...hat.com>
Subject: Re: [PATCH v2 00/19] mm: Support huge pfnmaps
On Mon, Aug 26, 2024 at 1:44 PM Peter Xu <peterx@...hat.com> wrote:
>
> v2:
> - Added tags
> - Let folio_walk_start() scan special pmd/pud bits [DavidH]
> - Switch copy_huge_pmd() COW+writable check into a VM_WARN_ON_ONCE()
> - Update commit message to drop mentioning of gup-fast, in patch "mm: Mark
> special bits for huge pfn mappings when inject" [JasonG]
> - In gup-fast, reorder _special check v.s. _devmap check, so as to make
> pmd/pud path look the same as pte path [DavidH, JasonG]
> - Enrich comments for follow_pfnmap*() API, emphasize the risk when PFN is
> used after the end() is invoked, s/-ve/negative/ [JasonG, Sean]
>
> Overview
> ========
>
> This series is based on mm-unstable, commit b659edec079c of Aug 26th
> latest, with patch "vma remove the unneeded avc bound with non-CoWed folio"
> reverted, as reported broken [0].
>
> This series implements huge pfnmaps support for mm in general. Huge pfnmap
> allows e.g. VM_PFNMAP vmas to map in either PMD or PUD levels, similar to
> what we do with dax / thp / hugetlb so far to benefit from TLB hits. Now
> we extend that idea to PFN mappings, e.g. PCI MMIO bars where it can grow
> as large as 8GB or even bigger.
>
> Currently, only x86_64 (1G+2M) and arm64 (2M) are supported. The last
> patch (from Alex Williamson) will be the first user of huge pfnmap, so as
> to enable vfio-pci driver to fault in huge pfn mappings.
>
> Implementation
> ==============
>
> In reality, it's relatively simple to add such support comparing to many
> other types of mappings, because of PFNMAP's specialties when there's no
> vmemmap backing it, so that most of the kernel routines on huge mappings
> should simply already fail for them, like GUPs or old-school follow_page()
> (which is recently rewritten to be folio_walk* APIs by David).
>
> One trick here is that we're still unmature on PUDs in generic paths here
> and there, as DAX is so far the only user. This patchset will add the 2nd
> user of it. Hugetlb can be a 3rd user if the hugetlb unification work can
> go on smoothly, but to be discussed later.
>
> The other trick is how to allow gup-fast working for such huge mappings
> even if there's no direct sign of knowing whether it's a normal page or
> MMIO mapping. This series chose to keep the pte_special solution, so that
> it reuses similar idea on setting a special bit to pfnmap PMDs/PUDs so that
> gup-fast will be able to identify them and fail properly.
>
> Along the way, we'll also notice that the major pgtable pfn walker, aka,
> follow_pte(), will need to retire soon due to the fact that it only works
> with ptes. A new set of simple API is introduced (follow_pfnmap* API) to
> be able to do whatever follow_pte() can already do, plus that it can also
> process huge pfnmaps now. Half of this series is about that and converting
> all existing pfnmap walkers to use the new API properly. Hopefully the new
> API also looks better to avoid exposing e.g. pgtable lock details into the
> callers, so that it can be used in an even more straightforward way.
>
> Here, three more options will be introduced and involved in huge pfnmap:
>
> - ARCH_SUPPORTS_HUGE_PFNMAP
>
> Arch developers will need to select this option when huge pfnmap is
> supported in arch's Kconfig. After this patchset applied, both x86_64
> and arm64 will start to enable it by default.
>
> - ARCH_SUPPORTS_PMD_PFNMAP / ARCH_SUPPORTS_PUD_PFNMAP
>
> These options are for driver developers to identify whether current
> arch / config supports huge pfnmaps, making decision on whether it can
> use the huge pfnmap APIs to inject them. One can refer to the last
> vfio-pci patch from Alex on the use of them properly in a device
> driver.
>
> So after the whole set applied, and if one would enable some dynamic debug
> lines in vfio-pci core files, we should observe things like:
>
> vfio-pci 0000:00:06.0: vfio_pci_mmap_huge_fault(,order = 9) BAR 0 page offset 0x0: 0x100
> vfio-pci 0000:00:06.0: vfio_pci_mmap_huge_fault(,order = 9) BAR 0 page offset 0x200: 0x100
> vfio-pci 0000:00:06.0: vfio_pci_mmap_huge_fault(,order = 9) BAR 0 page offset 0x400: 0x100
>
> In this specific case, it says that vfio-pci faults in PMDs properly for a
> few BAR0 offsets.
>
> Patch Layout
> ============
>
> Patch 1: Introduce the new options mentioned above for huge PFNMAPs
> Patch 2: A tiny cleanup
> Patch 3-8: Preparation patches for huge pfnmap (include introduce
> special bit for pmd/pud)
> Patch 9-16: Introduce follow_pfnmap*() API, use it everywhere, and
> then drop follow_pte() API
> Patch 17: Add huge pfnmap support for x86_64
> Patch 18: Add huge pfnmap support for arm64
> Patch 19: Add vfio-pci support for all kinds of huge pfnmaps (Alex)
>
> TODO
> ====
>
> More architectures / More page sizes
> ------------------------------------
>
> Currently only x86_64 (2M+1G) and arm64 (2M) are supported. There seems to
> have plan to support arm64 1G later on top of this series [2].
>
> Any arch will need to first support THP / THP_1G, then provide a special
> bit in pmds/puds to support huge pfnmaps.
>
> remap_pfn_range() support
> -------------------------
>
> Currently, remap_pfn_range() still only maps PTEs. With the new option,
> remap_pfn_range() can logically start to inject either PMDs or PUDs when
> the alignment requirements match on the VAs.
>
> When the support is there, it should be able to silently benefit all
> drivers that is using remap_pfn_range() in its mmap() handler on better TLB
> hit rate and overall faster MMIO accesses similar to processor on hugepages.
>
Hi Peter,
I am curious if there is any work needed for unmap_mapping_range? If a
driver hugely remap_pfn_range()ed at 1G granularity, can the driver
unmap at PAGE_SIZE granularity? For example, when handling a PFN is
poisoned in the 1G mapping, it would be great if the mapping can be
splitted to 2M mappings + 4k mappings, so only the single poisoned PFN
is lost. (Pretty much like the past proposal* to use HGM** to improve
hugetlb's memory failure handling).
Probably these questions can be answered after reading your code,
which I plan to do, but just want to ask in case you have an easy
answer for me.
* https://patchwork.plctlab.org/project/linux-kernel/cover/20230428004139.2899856-1-jiaqiyan@google.com/
** https://lwn.net/Articles/912017
> More driver support
> -------------------
>
> VFIO is so far the only consumer for the huge pfnmaps after this series
> applied. Besides above remap_pfn_range() generic optimization, device
> driver can also try to optimize its mmap() on a better VA alignment for
> either PMD/PUD sizes. This may, iiuc, normally require userspace changes,
> as the driver doesn't normally decide the VA to map a bar. But I don't
> think I know all the drivers to know the full picture.
>
> Tests Done
> ==========
>
> - Cross-build tests
>
> - run_vmtests.sh
>
> - Hacked e1000e QEMU with 128MB BAR 0, with some prefault test, mprotect()
> and fork() tests on the bar mapped
>
> - x86_64 + AMD GPU
> - Needs Alex's modified QEMU to guarantee proper VA alignment to make
> sure all pages to be mapped with PUDs
> - Main BAR (8GB) start to use PUD mappings
> - Sub BAR (??MBs?) start to use PMD mappings
> - Performance wise, slight improvement comparing to the old PTE mappings
>
> - aarch64 + NIC
> - Detached NIC test to make sure driver loads fine with PMD mappings
>
> Credits all go to Alex on help testing the GPU/NIC use cases above.
>
> Comments welcomed, thanks.
>
> [0] https://lore.kernel.org/r/73ad9540-3fb8-4154-9a4f-30a0a2b03d41@lucifer.local
> [1] https://lore.kernel.org/r/20240807194812.819412-1-peterx@redhat.com
> [2] https://lore.kernel.org/r/498e0731-81a4-4f75-95b4-a8ad0bcc7665@huawei.com
>
> Alex Williamson (1):
> vfio/pci: Implement huge_fault support
>
> Peter Xu (18):
> mm: Introduce ARCH_SUPPORTS_HUGE_PFNMAP and special bits to pmd/pud
> mm: Drop is_huge_zero_pud()
> mm: Mark special bits for huge pfn mappings when inject
> mm: Allow THP orders for PFNMAPs
> mm/gup: Detect huge pfnmap entries in gup-fast
> mm/pagewalk: Check pfnmap for folio_walk_start()
> mm/fork: Accept huge pfnmap entries
> mm: Always define pxx_pgprot()
> mm: New follow_pfnmap API
> KVM: Use follow_pfnmap API
> s390/pci_mmio: Use follow_pfnmap API
> mm/x86/pat: Use the new follow_pfnmap API
> vfio: Use the new follow_pfnmap API
> acrn: Use the new follow_pfnmap API
> mm/access_process_vm: Use the new follow_pfnmap API
> mm: Remove follow_pte()
> mm/x86: Support large pfn mappings
> mm/arm64: Support large pfn mappings
>
> arch/arm64/Kconfig | 1 +
> arch/arm64/include/asm/pgtable.h | 30 +++++
> arch/powerpc/include/asm/pgtable.h | 1 +
> arch/s390/include/asm/pgtable.h | 1 +
> arch/s390/pci/pci_mmio.c | 22 ++--
> arch/sparc/include/asm/pgtable_64.h | 1 +
> arch/x86/Kconfig | 1 +
> arch/x86/include/asm/pgtable.h | 80 +++++++-----
> arch/x86/mm/pat/memtype.c | 17 ++-
> drivers/vfio/pci/vfio_pci_core.c | 60 ++++++---
> drivers/vfio/vfio_iommu_type1.c | 16 +--
> drivers/virt/acrn/mm.c | 16 +--
> include/linux/huge_mm.h | 16 +--
> include/linux/mm.h | 57 ++++++++-
> include/linux/pgtable.h | 12 ++
> mm/Kconfig | 13 ++
> mm/gup.c | 6 +
> mm/huge_memory.c | 50 +++++---
> mm/memory.c | 183 ++++++++++++++++++++--------
> mm/pagewalk.c | 4 +-
> virt/kvm/kvm_main.c | 19 ++-
> 21 files changed, 425 insertions(+), 181 deletions(-)
>
> --
> 2.45.0
>
>
Powered by blists - more mailing lists