lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <31beeed3-b1be-439b-8a5b-db8c06dadc30@amd.com>
Date: Fri, 27 Jun 2025 14:49:43 +1000
From: Alexey Kardashevskiy <aik@....com>
To: Vishal Annapurve <vannapurve@...gle.com>, Jason Gunthorpe <jgg@...pe.ca>
Cc: Fuad Tabba <tabba@...gle.com>, Ackerley Tng <ackerleytng@...gle.com>,
 kvm@...r.kernel.org, linux-mm@...ck.org, linux-kernel@...r.kernel.org,
 x86@...nel.org, linux-fsdevel@...r.kernel.org, ajones@...tanamicro.com,
 akpm@...ux-foundation.org, amoorthy@...gle.com, anthony.yznaga@...cle.com,
 anup@...infault.org, aou@...s.berkeley.edu, bfoster@...hat.com,
 binbin.wu@...ux.intel.com, brauner@...nel.org, catalin.marinas@....com,
 chao.p.peng@...el.com, chenhuacai@...nel.org, dave.hansen@...el.com,
 david@...hat.com, dmatlack@...gle.com, dwmw@...zon.co.uk,
 erdemaktas@...gle.com, fan.du@...el.com, fvdl@...gle.com, graf@...zon.com,
 haibo1.xu@...el.com, hch@...radead.org, hughd@...gle.com,
 ira.weiny@...el.com, isaku.yamahata@...el.com, jack@...e.cz,
 james.morse@....com, jarkko@...nel.org, jgowans@...zon.com,
 jhubbard@...dia.com, jroedel@...e.de, jthoughton@...gle.com,
 jun.miao@...el.com, kai.huang@...el.com, keirf@...gle.com,
 kent.overstreet@...ux.dev, kirill.shutemov@...el.com,
 liam.merwick@...cle.com, maciej.wieczor-retman@...el.com,
 mail@...iej.szmigiero.name, maz@...nel.org, mic@...ikod.net,
 michael.roth@....com, mpe@...erman.id.au, muchun.song@...ux.dev,
 nikunj@....com, nsaenz@...zon.es, oliver.upton@...ux.dev,
 palmer@...belt.com, pankaj.gupta@....com, paul.walmsley@...ive.com,
 pbonzini@...hat.com, pdurrant@...zon.co.uk, peterx@...hat.com,
 pgonda@...gle.com, pvorel@...e.cz, qperret@...gle.com,
 quic_cvanscha@...cinc.com, quic_eberman@...cinc.com,
 quic_mnalajal@...cinc.com, quic_pderrin@...cinc.com,
 quic_pheragu@...cinc.com, quic_svaddagi@...cinc.com, quic_tsoni@...cinc.com,
 richard.weiyang@...il.com, rick.p.edgecombe@...el.com, rientjes@...gle.com,
 roypat@...zon.co.uk, rppt@...nel.org, seanjc@...gle.com, shuah@...nel.org,
 steven.price@....com, steven.sistare@...cle.com, suzuki.poulose@....com,
 thomas.lendacky@....com, usama.arif@...edance.com, vbabka@...e.cz,
 viro@...iv.linux.org.uk, vkuznets@...hat.com, wei.w.wang@...el.com,
 will@...nel.org, willy@...radead.org, xiaoyao.li@...el.com,
 yan.y.zhao@...el.com, yilun.xu@...el.com, yuzenghui@...wei.com,
 zhiquan1.li@...el.com
Subject: Re: [RFC PATCH v2 04/51] KVM: guest_memfd: Introduce
 KVM_GMEM_CONVERT_SHARED/PRIVATE ioctls



On 25/6/25 00:10, Vishal Annapurve wrote:
> On Tue, Jun 24, 2025 at 6:08 AM Jason Gunthorpe <jgg@...pe.ca> wrote:
>>
>> On Tue, Jun 24, 2025 at 06:23:54PM +1000, Alexey Kardashevskiy wrote:
>>
>>> Now, I am rebasing my RFC on top of this patchset and it fails in
>>> kvm_gmem_has_safe_refcount() as IOMMU holds references to all these
>>> folios in my RFC.
>>>
>>> So what is the expected sequence here? The userspace unmaps a DMA
>>> page and maps it back right away, all from the userspace? The end
>>> result will be the exactly same which seems useless. And IOMMU TLB
> 
>   As Jason described, ideally IOMMU just like KVM, should just:
> 1) Directly rely on guest_memfd for pinning -> no page refcounts taken
> by IOMMU stack
> 2) Directly query pfns from guest_memfd for both shared/private ranges
> 3) Implement an invalidation callback that guest_memfd can invoke on
> conversions.
> 
> Current flow:
> Private to Shared conversion via kvm_gmem_convert_range() -
>      1) guest_memfd invokes kvm_gmem_invalidate_begin() for the ranges
> on each bound memslot overlapping with the range
>           -> KVM has the concept of invalidation_begin() and end(),
> which effectively ensures that between these function calls, no new
> EPT/NPT entries can be added for the range.
>       2) guest_memfd invokes kvm_gmem_convert_should_proceed() which
> actually unmaps the KVM SEPT/NPT entries.
>       3) guest_memfd invokes kvm_gmem_execute_work() which updates the
> shareability and then splits the folios if needed
> 
> Shared to private conversion via kvm_gmem_convert_range() -
>      1) guest_memfd invokes kvm_gmem_invalidate_begin() for the ranges
> on each bound memslot overlapping with the range
>       2) guest_memfd invokes kvm_gmem_convert_should_proceed() which
> actually unmaps the host mappings which will unmap the KVM non-seucure
> EPT/NPT entries.
>       3) guest_memfd invokes kvm_gmem_execute_work() which updates the
> shareability and then merges the folios if needed.
> 
> ============================
> 
> For IOMMU, could something like below work?
> 
> * A new UAPI to bind IOMMU FDs with guest_memfd ranges

Done that.

> * VFIO_DMA_MAP/UNMAP operations modified to directly fetch pfns from
> guest_memfd ranges using kvm_gmem_get_pfn()

This API imho should drop the confusing kvm_ prefix.

>      -> kvm invokes kvm_gmem_is_private() to check for the range
> shareability, IOMMU could use the same or we could add an API in gmem
> that takes in access type and checks the shareability before returning
> the pfn.

Right now I cutnpasted kvm_gmem_get_folio() (which essentially is filemap_lock_folio()/filemap_alloc_folio()/__filemap_add_folio()) to avoid new links between iommufd.ko and kvm.ko. It is probably unavoidable though.


> * IOMMU stack exposes an invalidation callback that can be invoked by
> guest_memfd.
> 
> Private to Shared conversion via kvm_gmem_convert_range() -
>      1) guest_memfd invokes kvm_gmem_invalidate_begin() for the ranges
> on each bound memslot overlapping with the range
>       2) guest_memfd invokes kvm_gmem_convert_should_proceed() which
> actually unmaps the KVM SEPT/NPT entries.
>             -> guest_memfd invokes IOMMU invalidation callback to zap
> the secure IOMMU entries.
>       3) guest_memfd invokes kvm_gmem_execute_work() which updates the
> shareability and then splits the folios if needed
>       4) Userspace invokes IOMMU map operation to map the ranges in
> non-secure IOMMU.
> 
> Shared to private conversion via kvm_gmem_convert_range() -
>      1) guest_memfd invokes kvm_gmem_invalidate_begin() for the ranges
> on each bound memslot overlapping with the range
>       2) guest_memfd invokes kvm_gmem_convert_should_proceed() which
> actually unmaps the host mappings which will unmap the KVM non-seucure
> EPT/NPT entries.
>           -> guest_memfd invokes IOMMU invalidation callback to zap the
> non-secure IOMMU entries.
>       3) guest_memfd invokes kvm_gmem_execute_work() which updates the
> shareability and then merges the folios if needed.
>       4) Userspace invokes IOMMU map operation to map the ranges in secure IOMMU.


Alright (although this zap+map is not necessary on the AMD hw).


> There should be a way to block external IOMMU pagetable updates while
> guest_memfd is performing conversion e.g. something like
> kvm_invalidate_begin()/end().
> 
>>> is going to be flushed on a page conversion anyway (the RMPUPDATE
>>> instruction does that). All this is about AMD's x86 though.
>>
>> The iommu should not be using the VMA to manage the mapping. It should
> 
> +1.

Yeah, not doing this already, because I physically cannot map gmemfd's memory in IOMMU via VMA (which allocates memory via gup() so wrong memory is mapped in IOMMU). Thanks,


>> be directly linked to the guestmemfd in some way that does not disturb
>> its operations. I imagine there would be some kind of invalidation
>> callback directly to the iommu.
>>
>> Presumably that invalidation call back can include a reason for the
>> invalidation (addr change, shared/private conversion, etc)
>>
>> I'm not sure how we will figure out which case is which but guestmemfd
>> should allow the iommu to plug in either invalidation scheme..
>>
>> Probably invalidation should be a global to the FD thing, I imagine
>> that once invalidation is established the iommu will not be
>> incrementing page refcounts.
> 
> +1.

Alright. Thanks for the comments.

> 
>>
>> Jason

-- 
Alexey


Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ