linux-kernel - Re: [PATCH 00/19] mm: Support huge pfnmaps

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20240814221031.GA2032816@nvidia.com>
Date: Wed, 14 Aug 2024 19:10:31 -0300
From: Jason Gunthorpe <jgg@...dia.com>
To: Sean Christopherson <seanjc@...gle.com>
Cc: Peter Xu <peterx@...hat.com>, linux-mm@...ck.org,
	linux-kernel@...r.kernel.org, Oscar Salvador <osalvador@...e.de>,
	Axel Rasmussen <axelrasmussen@...gle.com>,
	linux-arm-kernel@...ts.infradead.org, x86@...nel.org,
	Will Deacon <will@...nel.org>, Gavin Shan <gshan@...hat.com>,
	Paolo Bonzini <pbonzini@...hat.com>, Zi Yan <ziy@...dia.com>,
	Andrew Morton <akpm@...ux-foundation.org>,
	Catalin Marinas <catalin.marinas@....com>,
	Ingo Molnar <mingo@...hat.com>,
	Alistair Popple <apopple@...dia.com>,
	Borislav Petkov <bp@...en8.de>,
	David Hildenbrand <david@...hat.com>,
	Thomas Gleixner <tglx@...utronix.de>, kvm@...r.kernel.org,
	Dave Hansen <dave.hansen@...ux.intel.com>,
	Alex Williamson <alex.williamson@...hat.com>,
	Yan Zhao <yan.y.zhao@...el.com>,
	Oliver Upton <oliver.upton@...ux.dev>,
	Marc Zyngier <maz@...nel.org>
Subject: Re: [PATCH 00/19] mm: Support huge pfnmaps

On Wed, Aug 14, 2024 at 01:54:04PM -0700, Sean Christopherson wrote:
> +Marc and Oliver
> 
> On Wed, Aug 14, 2024, Jason Gunthorpe wrote:
> > On Wed, Aug 14, 2024 at 07:35:01AM -0700, Sean Christopherson wrote:
> > > On Wed, Aug 14, 2024, Jason Gunthorpe wrote:
> > > > On Fri, Aug 09, 2024 at 12:08:50PM -0400, Peter Xu wrote:
> > > > > Overview
> > > > > ========
> > > > > 
> > > > > This series is based on mm-unstable, commit 98808d08fc0f of Aug 7th latest,
> > > > > plus dax 1g fix [1].  Note that this series should also apply if without
> > > > > the dax 1g fix series, but when without it, mprotect() will trigger similar
> > > > > errors otherwise on PUD mappings.
> > > > > 
> > > > > This series implements huge pfnmaps support for mm in general.  Huge pfnmap
> > > > > allows e.g. VM_PFNMAP vmas to map in either PMD or PUD levels, similar to
> > > > > what we do with dax / thp / hugetlb so far to benefit from TLB hits.  Now
> > > > > we extend that idea to PFN mappings, e.g. PCI MMIO bars where it can grow
> > > > > as large as 8GB or even bigger.
> > > > 
> > > > FWIW, I've started to hear people talk about needing this in the VFIO
> > > > context with VMs.
> > > > 
> > > > vfio/iommufd will reassemble the contiguous range from the 4k PFNs to
> > > > setup the IOMMU, but KVM is not able to do it so reliably.
> > > 
> > > Heh, KVM should very reliably do the exact opposite, i.e. KVM should never create
> > > a huge page unless the mapping is huge in the primary MMU.  And that's very much
> > > by design, as KVM has no knowledge of what actually resides at a given PFN, and
> > > thus can't determine whether or not its safe to create a huge page if KVM happens
> > > to realize the VM has access to a contiguous range of memory.
> > 
> > Oh? Someone told me recently x86 kvm had code to reassemble contiguous
> > ranges?
> 
> Nope.  KVM ARM does (see get_vma_page_shift()) but I strongly suspect that's only
> a win in very select use cases, and is overall a non-trivial loss.  

Ah that ARM behavior was probably what was being mentioned then! So
take my original remark as applying to this :)

> > I don't quite understand your safety argument, if the VMA has 1G of
> > contiguous physical memory described with 4K it is definitely safe for
> > KVM to reassemble that same memory and represent it as 1G.
>
> That would require taking mmap_lock to get the VMA, which would be a net negative,
> especially for workloads that are latency sensitive.

You can aggregate if the read and aggregating logic are protected by
mmu notifiers, I think. A invalidation would still have enough
information to clear the aggregate shadow entry. If you get a sequence
number collision then you'd throw away the aggregation.

But yes, I also think it would be slow to have aggregation logic in
KVM. Doing in the main mmu is much better.

Jason