[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <Zr1KiHO0pxPdYD_U@google.com>
Date: Wed, 14 Aug 2024 17:23:36 -0700
From: Sean Christopherson <seanjc@...gle.com>
To: Oliver Upton <oliver.upton@...ux.dev>
Cc: Jason Gunthorpe <jgg@...dia.com>, Peter Xu <peterx@...hat.com>, linux-mm@...ck.org,
linux-kernel@...r.kernel.org, Oscar Salvador <osalvador@...e.de>,
Axel Rasmussen <axelrasmussen@...gle.com>, linux-arm-kernel@...ts.infradead.org,
x86@...nel.org, Will Deacon <will@...nel.org>, Gavin Shan <gshan@...hat.com>,
Paolo Bonzini <pbonzini@...hat.com>, Zi Yan <ziy@...dia.com>,
Andrew Morton <akpm@...ux-foundation.org>, Catalin Marinas <catalin.marinas@....com>,
Ingo Molnar <mingo@...hat.com>, Alistair Popple <apopple@...dia.com>, Borislav Petkov <bp@...en8.de>,
David Hildenbrand <david@...hat.com>, Thomas Gleixner <tglx@...utronix.de>, kvm@...r.kernel.org,
Dave Hansen <dave.hansen@...ux.intel.com>, Alex Williamson <alex.williamson@...hat.com>,
Yan Zhao <yan.y.zhao@...el.com>, Marc Zyngier <maz@...nel.org>
Subject: Re: [PATCH 00/19] mm: Support huge pfnmaps
On Wed, Aug 14, 2024, Oliver Upton wrote:
> On Wed, Aug 14, 2024 at 04:28:00PM -0700, Oliver Upton wrote:
> > On Wed, Aug 14, 2024 at 01:54:04PM -0700, Sean Christopherson wrote:
> > > TL;DR: it's probably worth looking at mmu_stress_test (was: max_guest_memory_test)
> > > on arm64, specifically the mprotect() testcase[1], as performance is significantly
> > > worse compared to x86,
> >
> > Sharing what we discussed offline:
> >
> > Sean was using a machine w/o FEAT_FWB for this test, so the increased
> > runtime on arm64 is likely explained by the CMOs we're doing when
> > creating or invalidating a stage-2 PTE.
> >
> > Using a machine w/ FEAT_FWB would be better for making these sort of
> > cross-architecture comparisons. Beyond CMOs, we do have some
>
> ... some heavy barriers (e.g. DSB(ishst)) we use to ensure page table
> updates are visible to the system. So there could still be some
> arch-specific quirks that'll show up in the test.
Nope, 'twas FWB. On a system with FWB, ARM nicely outperforms x86 on mprotect()
when vCPUs stop on the first -EFAULT. I suspect because ARM can do broadcast TLB
invalidations and doesn't need to interrupt and wait for every vCPU to respond.
run1 = 10.723194154s, reset = 0.000014732s, run2 = 0.013790876s, ro = 2.151261587s, rw = 10.624272116s
However, having vCPUs continue faulting while mprotect() is running turns the
tables, I suspect due to mmap_lock
run1 = 10.768003815s, reset = 0.000012051s, run2 = 0.013781921s, ro = 23.277624455s, rw = 10.649136889s
The x86 numbers since they're out of sight now:
-EFAULT once
run1 = 6.873408794s, reset = 0.000165898s, run2 = 0.035537803s, ro = 6.149083106s, rw = 7.713627355s
-EFAULT forever
run1 = 6.923218747s, reset = 0.000167050s, run2 = 0.034676225s, ro = 14.599445790s, rw = 7.763152792s
> > > and there might be bugs lurking the mmu_notifier flows.
> >
> > Impossible! :)
> >
> > > Jumping back to mmap_lock, adding a lock, vma_lookup(), and unlock in x86's page
> > > fault path for valid VMAs does introduce a performance regression, but only ~30%,
> > > not the ~6x jump from x86 to arm64. So that too makes it unlikely taking mmap_lock
> > > is the main problem, though it's still good justification for avoid mmap_lock in
> > > the page fault path.
> >
> > I'm curious how much of that 30% in a microbenchmark would translate to
> > real world performance, since it isn't *that* egregious.
vCPU jitter is the big problem, especially if userspace is doing something odd,
and/or if the kernel is preemptible (which also triggers yeild-on-contention logic
for spinlocks, ew). E.g. the range-based retry to avoid spinning and waiting on
an unrelated MM operation was added by the ChromeOS folks[1] to resolve issues
where an MM operation got preempted and so blocked vCPU faults.
But even for cloud setups with a non-preemptible kernel, contending with unrelated
userspace VMM modification can be problematic, e.g. it turns out even the
gfn_to_pfn_cache logic needs range-based retry[2] (though that's a rather
pathological case where userspace is spamming madvise() to the point where vCPUs
can't even make forward progress).
> > We also have other uses for getting at the VMA beyond mapping granularity
> > (MTE and the VFIO Normal-NC hint) that'd require some attention too.
Yeah, though it seems like it'd be easy enough to take mmap_lock if and only if
it's necessary, e.g. similar to how common KVM takes it only if it encounters
VM_PFNMAP'd memory.
E.g. take mmap_lock if and only if MTE is active (I assume that's uncommon?), or
if the fault is to device memory.
[1] https://lore.kernel.org/all/20210222024522.1751719-1-stevensd@google.com
[2] https://lore.kernel.org/all/f862cefff2ed3f4211b69d785670f41667703cf3.camel@infradead.org
Powered by blists - more mailing lists