lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <Zr0ZbPQHVNzmvwa6@google.com>
Date: Wed, 14 Aug 2024 13:54:04 -0700
From: Sean Christopherson <seanjc@...gle.com>
To: Jason Gunthorpe <jgg@...dia.com>
Cc: Peter Xu <peterx@...hat.com>, linux-mm@...ck.org, linux-kernel@...r.kernel.org, 
	Oscar Salvador <osalvador@...e.de>, Axel Rasmussen <axelrasmussen@...gle.com>, 
	linux-arm-kernel@...ts.infradead.org, x86@...nel.org, 
	Will Deacon <will@...nel.org>, Gavin Shan <gshan@...hat.com>, Paolo Bonzini <pbonzini@...hat.com>, 
	Zi Yan <ziy@...dia.com>, Andrew Morton <akpm@...ux-foundation.org>, 
	Catalin Marinas <catalin.marinas@....com>, Ingo Molnar <mingo@...hat.com>, 
	Alistair Popple <apopple@...dia.com>, Borislav Petkov <bp@...en8.de>, David Hildenbrand <david@...hat.com>, 
	Thomas Gleixner <tglx@...utronix.de>, kvm@...r.kernel.org, 
	Dave Hansen <dave.hansen@...ux.intel.com>, Alex Williamson <alex.williamson@...hat.com>, 
	Yan Zhao <yan.y.zhao@...el.com>, Oliver Upton <oliver.upton@...ux.dev>, 
	Marc Zyngier <maz@...nel.org>
Subject: Re: [PATCH 00/19] mm: Support huge pfnmaps

+Marc and Oliver

On Wed, Aug 14, 2024, Jason Gunthorpe wrote:
> On Wed, Aug 14, 2024 at 07:35:01AM -0700, Sean Christopherson wrote:
> > On Wed, Aug 14, 2024, Jason Gunthorpe wrote:
> > > On Fri, Aug 09, 2024 at 12:08:50PM -0400, Peter Xu wrote:
> > > > Overview
> > > > ========
> > > > 
> > > > This series is based on mm-unstable, commit 98808d08fc0f of Aug 7th latest,
> > > > plus dax 1g fix [1].  Note that this series should also apply if without
> > > > the dax 1g fix series, but when without it, mprotect() will trigger similar
> > > > errors otherwise on PUD mappings.
> > > > 
> > > > This series implements huge pfnmaps support for mm in general.  Huge pfnmap
> > > > allows e.g. VM_PFNMAP vmas to map in either PMD or PUD levels, similar to
> > > > what we do with dax / thp / hugetlb so far to benefit from TLB hits.  Now
> > > > we extend that idea to PFN mappings, e.g. PCI MMIO bars where it can grow
> > > > as large as 8GB or even bigger.
> > > 
> > > FWIW, I've started to hear people talk about needing this in the VFIO
> > > context with VMs.
> > > 
> > > vfio/iommufd will reassemble the contiguous range from the 4k PFNs to
> > > setup the IOMMU, but KVM is not able to do it so reliably.
> > 
> > Heh, KVM should very reliably do the exact opposite, i.e. KVM should never create
> > a huge page unless the mapping is huge in the primary MMU.  And that's very much
> > by design, as KVM has no knowledge of what actually resides at a given PFN, and
> > thus can't determine whether or not its safe to create a huge page if KVM happens
> > to realize the VM has access to a contiguous range of memory.
> 
> Oh? Someone told me recently x86 kvm had code to reassemble contiguous
> ranges?

Nope.  KVM ARM does (see get_vma_page_shift()) but I strongly suspect that's only
a win in very select use cases, and is overall a non-trivial loss.  

> I don't quite understand your safety argument, if the VMA has 1G of
> contiguous physical memory described with 4K it is definitely safe for
> KVM to reassemble that same memory and represent it as 1G.

That would require taking mmap_lock to get the VMA, which would be a net negative,
especially for workloads that are latency sensitive.   E.g. if userspace is doing
mprotect(), madvise(), etc. on VMA that is NOT mapped into the guest, taking
mmap_lock in the guest page fault path means vCPUs block waiting for the unrelated
host operation to complete.  And vise versa, subsequent host operations can be
blocked waiting on vCPUs.

Which reminds me...

Marc/Oliver,

TL;DR: it's probably worth looking at mmu_stress_test (was: max_guest_memory_test)
on arm64, specifically the mprotect() testcase[1], as performance is significantly
worse compared to x86, and there might be bugs lurking the mmu_notifier flows.

When running mmu_stress_test the mprotect() phase that makes guest memory read-only
takes more than three times as long on arm64 versus x86.  The time to initially
popuplate memory (run1) is also notably higher on arm64, as is the time to
mprotect() back to RW protections.

The test doesn't go super far out of its way to control the environment, but it
should be a fairly reasonable apples-to-apples comparison.  

Ouch.  I take that back, it's not apples-to-apples, because the test does more
work for x86.  On x86, during mprotect(PROT_READ), the userspace side skips the
faulting instruction on -EFAULT and so vCPUs keep writing for the entire duration.
Other architectures stop running the vCPU after the first write -EFAULT and wait
for the mproptect() to complete.  If I comment out the x86-only logic and have
vCPUs stop on the first -EFAULT, the mprotect() goes way down.

/me fiddles with arm64

And if I have arm64 vCPUs keep faulting, the time goes up, as exptected.

With 128GiB of guest memory (aliased to a single 2GiB chunk of physical memory),
and 48 vCPUs (on systems with 64+ CPUs), stopping on the first fault:

 x86:
  run1 =  6.873408794s, reset = 0.000165898s, run2 = 0.035537803s, ro =  6.149083106s, rw = 7.713627355s

 arm64:
  run1 = 13.960144969s, reset = 0.000178596s, run2 = 0.018020005s, ro = 50.924434051s, rw = 14.712983786

and skipping on -EFAULT and thus writing throughout mprotect():

 x86:
  run1 =  6.923218747s, reset = 0.000167050s, run2 = 0.034676225s, ro = 14.599445790s, rw = 7.763152792s

 arm64:
  run1 = 13.543469513s, reset = 0.000018763s, run2 = 0.020533896s, ro = 81.063504438s, rw = 14.967504024s

I originally suspected at the main source of difference is that user_mem_abort()
takes mmap_lock for read.  But I doubt that's the case now that I realize arm64
vCPUs stop after the first -EFAULT, i.e. won't contend mmap_lock.

And it's shouldn't be the lack of support for mmu_invalidate_retry_gfn() (on arm64
vs. x86), because the mprotect() is relevant to guest memory (though range-based
retry is something that KVM ARM likely should support).  And again, arm64 is still
much slower when vCPUs stop on the first -EFAULT.

However, before I realized mmap_lock likely wasn't the main problem, I tried to
prove that taking mmap_lock is problematic, and that didn't end well.  When I
hacked user_mem_abort() to not take mmap_lock, the host reboots when the mprotect()
read-only phase kicks in.  AFAICT, there is no crash, e.g. no kdump and nothing
printed to the console, the host suddenly just starts executing firmware code.

To try and rule out a hidden dependency I'm missing, I tried trimming out pretty
much everything that runs under mmap_lock (and only running the selftest, which
doesn't do anything "odd", e.g. doesn't passthrough device memory).  I also forced
small pages, e.g. in case transparent_hugepage_adjust() is somehow reliant on
mmap_lock being taken, to no avail.  Just in case someone can spot an obvious
flaw in my hack, the final diff I tried is below.

Jumping back to mmap_lock, adding a lock, vma_lookup(), and unlock in x86's page
fault path for valid VMAs does introduce a performance regression, but only ~30%,
not the ~6x jump from x86 to arm64.  So that too makes it unlikely taking mmap_lock
is the main problem, though it's still good justification for avoid mmap_lock in
the page fault path.

[1] https://lore.kernel.org/all/20240809194335.1726916-9-seanjc@google.com

---
 arch/arm64/kvm/mmu.c | 167 +++----------------------------------------
 1 file changed, 9 insertions(+), 158 deletions(-)

diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index 6981b1bc0946..df551c19f626 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -1424,14 +1424,12 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
 			  bool fault_is_perm)
 {
 	int ret = 0;
-	bool write_fault, writable, force_pte = false;
-	bool exec_fault, mte_allowed;
-	bool device = false, vfio_allow_any_uc = false;
+	bool write_fault, writable;
+	bool exec_fault;
 	unsigned long mmu_seq;
 	phys_addr_t ipa = fault_ipa;
 	struct kvm *kvm = vcpu->kvm;
 	struct kvm_mmu_memory_cache *memcache = &vcpu->arch.mmu_page_cache;
-	struct vm_area_struct *vma;
 	short vma_shift;
 	gfn_t gfn;
 	kvm_pfn_t pfn;
@@ -1451,6 +1449,9 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
 		return -EFAULT;
 	}
 
+	if (WARN_ON_ONCE(nested) || WARN_ON_ONCE(kvm_has_mte(kvm)))
+		return -EIO;
+
 	/*
 	 * Permission faults just need to update the existing leaf entry,
 	 * and so normally don't require allocations from the memcache. The
@@ -1464,92 +1465,9 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
 			return ret;
 	}
 
-	/*
-	 * Let's check if we will get back a huge page backed by hugetlbfs, or
-	 * get block mapping for device MMIO region.
-	 */
-	mmap_read_lock(current->mm);
-	vma = vma_lookup(current->mm, hva);
-	if (unlikely(!vma)) {
-		kvm_err("Failed to find VMA for hva 0x%lx\n", hva);
-		mmap_read_unlock(current->mm);
-		return -EFAULT;
-	}
-
-	/*
-	 * logging_active is guaranteed to never be true for VM_PFNMAP
-	 * memslots.
-	 */
-	if (logging_active) {
-		force_pte = true;
-		vma_shift = PAGE_SHIFT;
-	} else {
-		vma_shift = get_vma_page_shift(vma, hva);
-	}
-
-	switch (vma_shift) {
-#ifndef __PAGETABLE_PMD_FOLDED
-	case PUD_SHIFT:
-		if (fault_supports_stage2_huge_mapping(memslot, hva, PUD_SIZE))
-			break;
-		fallthrough;
-#endif
-	case CONT_PMD_SHIFT:
-		vma_shift = PMD_SHIFT;
-		fallthrough;
-	case PMD_SHIFT:
-		if (fault_supports_stage2_huge_mapping(memslot, hva, PMD_SIZE))
-			break;
-		fallthrough;
-	case CONT_PTE_SHIFT:
-		vma_shift = PAGE_SHIFT;
-		force_pte = true;
-		fallthrough;
-	case PAGE_SHIFT:
-		break;
-	default:
-		WARN_ONCE(1, "Unknown vma_shift %d", vma_shift);
-	}
-
+	vma_shift = PAGE_SHIFT;
 	vma_pagesize = 1UL << vma_shift;
-
-	if (nested) {
-		unsigned long max_map_size;
-
-		max_map_size = force_pte ? PAGE_SIZE : PUD_SIZE;
-
-		ipa = kvm_s2_trans_output(nested);
-
-		/*
-		 * If we're about to create a shadow stage 2 entry, then we
-		 * can only create a block mapping if the guest stage 2 page
-		 * table uses at least as big a mapping.
-		 */
-		max_map_size = min(kvm_s2_trans_size(nested), max_map_size);
-
-		/*
-		 * Be careful that if the mapping size falls between
-		 * two host sizes, take the smallest of the two.
-		 */
-		if (max_map_size >= PMD_SIZE && max_map_size < PUD_SIZE)
-			max_map_size = PMD_SIZE;
-		else if (max_map_size >= PAGE_SIZE && max_map_size < PMD_SIZE)
-			max_map_size = PAGE_SIZE;
-
-		force_pte = (max_map_size == PAGE_SIZE);
-		vma_pagesize = min(vma_pagesize, (long)max_map_size);
-	}
-
-	if (vma_pagesize == PMD_SIZE || vma_pagesize == PUD_SIZE)
-		fault_ipa &= ~(vma_pagesize - 1);
-
 	gfn = ipa >> PAGE_SHIFT;
-	mte_allowed = kvm_vma_mte_allowed(vma);
-
-	vfio_allow_any_uc = vma->vm_flags & VM_ALLOW_ANY_UNCACHED;
-
-	/* Don't use the VMA after the unlock -- it may have vanished */
-	vma = NULL;
 
 	/*
 	 * Read mmu_invalidate_seq so that KVM can detect if the results of
@@ -1560,7 +1478,7 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
 	 * with the smp_wmb() in kvm_mmu_invalidate_end().
 	 */
 	mmu_seq = vcpu->kvm->mmu_invalidate_seq;
-	mmap_read_unlock(current->mm);
+	smp_rmb();
 
 	pfn = __gfn_to_pfn_memslot(memslot, gfn, false, false, NULL,
 				   write_fault, &writable, NULL);
@@ -1571,19 +1489,7 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
 	if (is_error_noslot_pfn(pfn))
 		return -EFAULT;
 
-	if (kvm_is_device_pfn(pfn)) {
-		/*
-		 * If the page was identified as device early by looking at
-		 * the VMA flags, vma_pagesize is already representing the
-		 * largest quantity we can map.  If instead it was mapped
-		 * via gfn_to_pfn_prot(), vma_pagesize is set to PAGE_SIZE
-		 * and must not be upgraded.
-		 *
-		 * In both cases, we don't let transparent_hugepage_adjust()
-		 * change things at the last minute.
-		 */
-		device = true;
-	} else if (logging_active && !write_fault) {
+	if (logging_active && !write_fault) {
 		/*
 		 * Only actually map the page as writable if this was a write
 		 * fault.
@@ -1591,28 +1497,6 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
 		writable = false;
 	}
 
-	if (exec_fault && device)
-		return -ENOEXEC;
-
-	/*
-	 * Potentially reduce shadow S2 permissions to match the guest's own
-	 * S2. For exec faults, we'd only reach this point if the guest
-	 * actually allowed it (see kvm_s2_handle_perm_fault).
-	 *
-	 * Also encode the level of the original translation in the SW bits
-	 * of the leaf entry as a proxy for the span of that translation.
-	 * This will be retrieved on TLB invalidation from the guest and
-	 * used to limit the invalidation scope if a TTL hint or a range
-	 * isn't provided.
-	 */
-	if (nested) {
-		writable &= kvm_s2_trans_writable(nested);
-		if (!kvm_s2_trans_readable(nested))
-			prot &= ~KVM_PGTABLE_PROT_R;
-
-		prot |= kvm_encode_nested_level(nested);
-	}
-
 	read_lock(&kvm->mmu_lock);
 	pgt = vcpu->arch.hw_mmu->pgt;
 	if (mmu_invalidate_retry(kvm, mmu_seq)) {
@@ -1620,46 +1504,13 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
 		goto out_unlock;
 	}
 
-	/*
-	 * If we are not forced to use page mapping, check if we are
-	 * backed by a THP and thus use block mapping if possible.
-	 */
-	if (vma_pagesize == PAGE_SIZE && !(force_pte || device)) {
-		if (fault_is_perm && fault_granule > PAGE_SIZE)
-			vma_pagesize = fault_granule;
-		else
-			vma_pagesize = transparent_hugepage_adjust(kvm, memslot,
-								   hva, &pfn,
-								   &fault_ipa);
-
-		if (vma_pagesize < 0) {
-			ret = vma_pagesize;
-			goto out_unlock;
-		}
-	}
-
-	if (!fault_is_perm && !device && kvm_has_mte(kvm)) {
-		/* Check the VMM hasn't introduced a new disallowed VMA */
-		if (mte_allowed) {
-			sanitise_mte_tags(kvm, pfn, vma_pagesize);
-		} else {
-			ret = -EFAULT;
-			goto out_unlock;
-		}
-	}
-
 	if (writable)
 		prot |= KVM_PGTABLE_PROT_W;
 
 	if (exec_fault)
 		prot |= KVM_PGTABLE_PROT_X;
 
-	if (device) {
-		if (vfio_allow_any_uc)
-			prot |= KVM_PGTABLE_PROT_NORMAL_NC;
-		else
-			prot |= KVM_PGTABLE_PROT_DEVICE;
-	} else if (cpus_have_final_cap(ARM64_HAS_CACHE_DIC) &&
+	if (cpus_have_final_cap(ARM64_HAS_CACHE_DIC) &&
 		   (!nested || kvm_s2_trans_executable(nested))) {
 		prot |= KVM_PGTABLE_PROT_X;
 	}

base-commit: 15e1c3d65975524c5c792fcd59f7d89f00402261
-- 

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ