lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <CAGtprH_qh8sEY3s-JucW3n1Wvoq7jdVZDDokvG5HzPf0HV2=pg@mail.gmail.com>
Date: Tue, 24 Jun 2025 07:10:38 -0700
From: Vishal Annapurve <vannapurve@...gle.com>
To: Jason Gunthorpe <jgg@...pe.ca>
Cc: Alexey Kardashevskiy <aik@....com>, Fuad Tabba <tabba@...gle.com>, Ackerley Tng <ackerleytng@...gle.com>, 
	kvm@...r.kernel.org, linux-mm@...ck.org, linux-kernel@...r.kernel.org, 
	x86@...nel.org, linux-fsdevel@...r.kernel.org, ajones@...tanamicro.com, 
	akpm@...ux-foundation.org, amoorthy@...gle.com, anthony.yznaga@...cle.com, 
	anup@...infault.org, aou@...s.berkeley.edu, bfoster@...hat.com, 
	binbin.wu@...ux.intel.com, brauner@...nel.org, catalin.marinas@....com, 
	chao.p.peng@...el.com, chenhuacai@...nel.org, dave.hansen@...el.com, 
	david@...hat.com, dmatlack@...gle.com, dwmw@...zon.co.uk, 
	erdemaktas@...gle.com, fan.du@...el.com, fvdl@...gle.com, graf@...zon.com, 
	haibo1.xu@...el.com, hch@...radead.org, hughd@...gle.com, ira.weiny@...el.com, 
	isaku.yamahata@...el.com, jack@...e.cz, james.morse@....com, 
	jarkko@...nel.org, jgowans@...zon.com, jhubbard@...dia.com, jroedel@...e.de, 
	jthoughton@...gle.com, jun.miao@...el.com, kai.huang@...el.com, 
	keirf@...gle.com, kent.overstreet@...ux.dev, kirill.shutemov@...el.com, 
	liam.merwick@...cle.com, maciej.wieczor-retman@...el.com, 
	mail@...iej.szmigiero.name, maz@...nel.org, mic@...ikod.net, 
	michael.roth@....com, mpe@...erman.id.au, muchun.song@...ux.dev, 
	nikunj@....com, nsaenz@...zon.es, oliver.upton@...ux.dev, palmer@...belt.com, 
	pankaj.gupta@....com, paul.walmsley@...ive.com, pbonzini@...hat.com, 
	pdurrant@...zon.co.uk, peterx@...hat.com, pgonda@...gle.com, pvorel@...e.cz, 
	qperret@...gle.com, quic_cvanscha@...cinc.com, quic_eberman@...cinc.com, 
	quic_mnalajal@...cinc.com, quic_pderrin@...cinc.com, quic_pheragu@...cinc.com, 
	quic_svaddagi@...cinc.com, quic_tsoni@...cinc.com, richard.weiyang@...il.com, 
	rick.p.edgecombe@...el.com, rientjes@...gle.com, roypat@...zon.co.uk, 
	rppt@...nel.org, seanjc@...gle.com, shuah@...nel.org, steven.price@....com, 
	steven.sistare@...cle.com, suzuki.poulose@....com, thomas.lendacky@....com, 
	usama.arif@...edance.com, vbabka@...e.cz, viro@...iv.linux.org.uk, 
	vkuznets@...hat.com, wei.w.wang@...el.com, will@...nel.org, 
	willy@...radead.org, xiaoyao.li@...el.com, yan.y.zhao@...el.com, 
	yilun.xu@...el.com, yuzenghui@...wei.com, zhiquan1.li@...el.com
Subject: Re: [RFC PATCH v2 04/51] KVM: guest_memfd: Introduce
 KVM_GMEM_CONVERT_SHARED/PRIVATE ioctls

On Tue, Jun 24, 2025 at 6:08 AM Jason Gunthorpe <jgg@...pe.ca> wrote:
>
> On Tue, Jun 24, 2025 at 06:23:54PM +1000, Alexey Kardashevskiy wrote:
>
> > Now, I am rebasing my RFC on top of this patchset and it fails in
> > kvm_gmem_has_safe_refcount() as IOMMU holds references to all these
> > folios in my RFC.
> >
> > So what is the expected sequence here? The userspace unmaps a DMA
> > page and maps it back right away, all from the userspace? The end
> > result will be the exactly same which seems useless. And IOMMU TLB

 As Jason described, ideally IOMMU just like KVM, should just:
1) Directly rely on guest_memfd for pinning -> no page refcounts taken
by IOMMU stack
2) Directly query pfns from guest_memfd for both shared/private ranges
3) Implement an invalidation callback that guest_memfd can invoke on
conversions.

Current flow:
Private to Shared conversion via kvm_gmem_convert_range() -
    1) guest_memfd invokes kvm_gmem_invalidate_begin() for the ranges
on each bound memslot overlapping with the range
         -> KVM has the concept of invalidation_begin() and end(),
which effectively ensures that between these function calls, no new
EPT/NPT entries can be added for the range.
     2) guest_memfd invokes kvm_gmem_convert_should_proceed() which
actually unmaps the KVM SEPT/NPT entries.
     3) guest_memfd invokes kvm_gmem_execute_work() which updates the
shareability and then splits the folios if needed

Shared to private conversion via kvm_gmem_convert_range() -
    1) guest_memfd invokes kvm_gmem_invalidate_begin() for the ranges
on each bound memslot overlapping with the range
     2) guest_memfd invokes kvm_gmem_convert_should_proceed() which
actually unmaps the host mappings which will unmap the KVM non-seucure
EPT/NPT entries.
     3) guest_memfd invokes kvm_gmem_execute_work() which updates the
shareability and then merges the folios if needed.

============================

For IOMMU, could something like below work?

* A new UAPI to bind IOMMU FDs with guest_memfd ranges
* VFIO_DMA_MAP/UNMAP operations modified to directly fetch pfns from
guest_memfd ranges using kvm_gmem_get_pfn()
    -> kvm invokes kvm_gmem_is_private() to check for the range
shareability, IOMMU could use the same or we could add an API in gmem
that takes in access type and checks the shareability before returning
the pfn.
* IOMMU stack exposes an invalidation callback that can be invoked by
guest_memfd.

Private to Shared conversion via kvm_gmem_convert_range() -
    1) guest_memfd invokes kvm_gmem_invalidate_begin() for the ranges
on each bound memslot overlapping with the range
     2) guest_memfd invokes kvm_gmem_convert_should_proceed() which
actually unmaps the KVM SEPT/NPT entries.
           -> guest_memfd invokes IOMMU invalidation callback to zap
the secure IOMMU entries.
     3) guest_memfd invokes kvm_gmem_execute_work() which updates the
shareability and then splits the folios if needed
     4) Userspace invokes IOMMU map operation to map the ranges in
non-secure IOMMU.

Shared to private conversion via kvm_gmem_convert_range() -
    1) guest_memfd invokes kvm_gmem_invalidate_begin() for the ranges
on each bound memslot overlapping with the range
     2) guest_memfd invokes kvm_gmem_convert_should_proceed() which
actually unmaps the host mappings which will unmap the KVM non-seucure
EPT/NPT entries.
         -> guest_memfd invokes IOMMU invalidation callback to zap the
non-secure IOMMU entries.
     3) guest_memfd invokes kvm_gmem_execute_work() which updates the
shareability and then merges the folios if needed.
     4) Userspace invokes IOMMU map operation to map the ranges in secure IOMMU.

There should be a way to block external IOMMU pagetable updates while
guest_memfd is performing conversion e.g. something like
kvm_invalidate_begin()/end().

> > is going to be flushed on a page conversion anyway (the RMPUPDATE
> > instruction does that). All this is about AMD's x86 though.
>
> The iommu should not be using the VMA to manage the mapping. It should

+1.

> be directly linked to the guestmemfd in some way that does not disturb
> its operations. I imagine there would be some kind of invalidation
> callback directly to the iommu.
>
> Presumably that invalidation call back can include a reason for the
> invalidation (addr change, shared/private conversion, etc)
>
> I'm not sure how we will figure out which case is which but guestmemfd
> should allow the iommu to plug in either invalidation scheme..
>
> Probably invalidation should be a global to the FD thing, I imagine
> that once invalidation is established the iommu will not be
> incrementing page refcounts.

+1.

>
> Jason

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ