linux-kernel - Re: [RFC PATCH v2 04/51] KVM: guest_memfd: Introduce KVM_GMEM_CONVERT

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <8f04f1df-d68d-4ef8-b176-595bbf00a9d1@amd.com>
Date: Mon, 30 Jun 2025 10:19:06 +1000
From: Alexey Kardashevskiy <aik@....com>
To: Vishal Annapurve <vannapurve@...gle.com>
Cc: Jason Gunthorpe <jgg@...pe.ca>, Fuad Tabba <tabba@...gle.com>,
 Ackerley Tng <ackerleytng@...gle.com>, kvm@...r.kernel.org,
 linux-mm@...ck.org, linux-kernel@...r.kernel.org, x86@...nel.org,
 linux-fsdevel@...r.kernel.org, ajones@...tanamicro.com,
 akpm@...ux-foundation.org, amoorthy@...gle.com, anthony.yznaga@...cle.com,
 anup@...infault.org, aou@...s.berkeley.edu, bfoster@...hat.com,
 binbin.wu@...ux.intel.com, brauner@...nel.org, catalin.marinas@....com,
 chao.p.peng@...el.com, chenhuacai@...nel.org, dave.hansen@...el.com,
 david@...hat.com, dmatlack@...gle.com, dwmw@...zon.co.uk,
 erdemaktas@...gle.com, fan.du@...el.com, fvdl@...gle.com, graf@...zon.com,
 haibo1.xu@...el.com, hch@...radead.org, hughd@...gle.com,
 ira.weiny@...el.com, isaku.yamahata@...el.com, jack@...e.cz,
 james.morse@....com, jarkko@...nel.org, jgowans@...zon.com,
 jhubbard@...dia.com, jroedel@...e.de, jthoughton@...gle.com,
 jun.miao@...el.com, kai.huang@...el.com, keirf@...gle.com,
 kent.overstreet@...ux.dev, kirill.shutemov@...el.com,
 liam.merwick@...cle.com, maciej.wieczor-retman@...el.com,
 mail@...iej.szmigiero.name, maz@...nel.org, mic@...ikod.net,
 michael.roth@....com, mpe@...erman.id.au, muchun.song@...ux.dev,
 nikunj@....com, nsaenz@...zon.es, oliver.upton@...ux.dev,
 palmer@...belt.com, pankaj.gupta@....com, paul.walmsley@...ive.com,
 pbonzini@...hat.com, pdurrant@...zon.co.uk, peterx@...hat.com,
 pgonda@...gle.com, pvorel@...e.cz, qperret@...gle.com,
 quic_cvanscha@...cinc.com, quic_eberman@...cinc.com,
 quic_mnalajal@...cinc.com, quic_pderrin@...cinc.com,
 quic_pheragu@...cinc.com, quic_svaddagi@...cinc.com, quic_tsoni@...cinc.com,
 richard.weiyang@...il.com, rick.p.edgecombe@...el.com, rientjes@...gle.com,
 roypat@...zon.co.uk, rppt@...nel.org, seanjc@...gle.com, shuah@...nel.org,
 steven.price@....com, steven.sistare@...cle.com, suzuki.poulose@....com,
 thomas.lendacky@....com, usama.arif@...edance.com, vbabka@...e.cz,
 viro@...iv.linux.org.uk, vkuznets@...hat.com, wei.w.wang@...el.com,
 will@...nel.org, willy@...radead.org, xiaoyao.li@...el.com,
 yan.y.zhao@...el.com, yilun.xu@...el.com, yuzenghui@...wei.com,
 zhiquan1.li@...el.com
Subject: Re: [RFC PATCH v2 04/51] KVM: guest_memfd: Introduce
 KVM_GMEM_CONVERT_SHARED/PRIVATE ioctls



On 28/6/25 01:17, Vishal Annapurve wrote:
> On Thu, Jun 26, 2025 at 9:50 PM Alexey Kardashevskiy <aik@....com> wrote:
>>
>>
>>
>> On 25/6/25 00:10, Vishal Annapurve wrote:
>>> On Tue, Jun 24, 2025 at 6:08 AM Jason Gunthorpe <jgg@...pe.ca> wrote:
>>>>
>>>> On Tue, Jun 24, 2025 at 06:23:54PM +1000, Alexey Kardashevskiy wrote:
>>>>
>>>>> Now, I am rebasing my RFC on top of this patchset and it fails in
>>>>> kvm_gmem_has_safe_refcount() as IOMMU holds references to all these
>>>>> folios in my RFC.
>>>>>
>>>>> So what is the expected sequence here? The userspace unmaps a DMA
>>>>> page and maps it back right away, all from the userspace? The end
>>>>> result will be the exactly same which seems useless. And IOMMU TLB
>>>
>>>    As Jason described, ideally IOMMU just like KVM, should just:
>>> 1) Directly rely on guest_memfd for pinning -> no page refcounts taken
>>> by IOMMU stack
>>> 2) Directly query pfns from guest_memfd for both shared/private ranges
>>> 3) Implement an invalidation callback that guest_memfd can invoke on
>>> conversions.
> 
> Conversions and truncations both.
> 
>>>
>>> Current flow:
>>> Private to Shared conversion via kvm_gmem_convert_range() -
>>>       1) guest_memfd invokes kvm_gmem_invalidate_begin() for the ranges
>>> on each bound memslot overlapping with the range
>>>            -> KVM has the concept of invalidation_begin() and end(),
>>> which effectively ensures that between these function calls, no new
>>> EPT/NPT entries can be added for the range.
>>>        2) guest_memfd invokes kvm_gmem_convert_should_proceed() which
>>> actually unmaps the KVM SEPT/NPT entries.
>>>        3) guest_memfd invokes kvm_gmem_execute_work() which updates the
>>> shareability and then splits the folios if needed
>>>
>>> Shared to private conversion via kvm_gmem_convert_range() -
>>>       1) guest_memfd invokes kvm_gmem_invalidate_begin() for the ranges
>>> on each bound memslot overlapping with the range
>>>        2) guest_memfd invokes kvm_gmem_convert_should_proceed() which
>>> actually unmaps the host mappings which will unmap the KVM non-seucure
>>> EPT/NPT entries.
>>>        3) guest_memfd invokes kvm_gmem_execute_work() which updates the
>>> shareability and then merges the folios if needed.
>>>
>>> ============================
>>>
>>> For IOMMU, could something like below work?
>>>
>>> * A new UAPI to bind IOMMU FDs with guest_memfd ranges
>>
>> Done that.
>>
>>> * VFIO_DMA_MAP/UNMAP operations modified to directly fetch pfns from
>>> guest_memfd ranges using kvm_gmem_get_pfn()
>>
>> This API imho should drop the confusing kvm_ prefix.
>>
>>>       -> kvm invokes kvm_gmem_is_private() to check for the range
>>> shareability, IOMMU could use the same or we could add an API in gmem
>>> that takes in access type and checks the shareability before returning
>>> the pfn.
>>
>> Right now I cutnpasted kvm_gmem_get_folio() (which essentially is filemap_lock_folio()/filemap_alloc_folio()/__filemap_add_folio()) to avoid new links between iommufd.ko and kvm.ko. It is probably unavoidable though.
> 
> I don't think that's the way to avoid links between iommufd.ko and
> kvm.ko. Cleaner way probably is to have gmem logic built-in and allow
> runtime registration of invalidation callbacks from KVM/IOMMU
> backends. Need to think about this more.

Yeah, otherwise iommufd.ko will have to install a hook in guest_memfd (==kvm.ko) in run time so more beloved symbol_get() :)

> 
>>
>>
>>> * IOMMU stack exposes an invalidation callback that can be invoked by
>>> guest_memfd.
>>>
>>> Private to Shared conversion via kvm_gmem_convert_range() -
>>>       1) guest_memfd invokes kvm_gmem_invalidate_begin() for the ranges
>>> on each bound memslot overlapping with the range
>>>        2) guest_memfd invokes kvm_gmem_convert_should_proceed() which
>>> actually unmaps the KVM SEPT/NPT entries.
>>>              -> guest_memfd invokes IOMMU invalidation callback to zap
>>> the secure IOMMU entries.
>>>        3) guest_memfd invokes kvm_gmem_execute_work() which updates the
>>> shareability and then splits the folios if needed
>>>        4) Userspace invokes IOMMU map operation to map the ranges in
>>> non-secure IOMMU.
>>>
>>> Shared to private conversion via kvm_gmem_convert_range() -
>>>       1) guest_memfd invokes kvm_gmem_invalidate_begin() for the ranges
>>> on each bound memslot overlapping with the range
>>>        2) guest_memfd invokes kvm_gmem_convert_should_proceed() which
>>> actually unmaps the host mappings which will unmap the KVM non-seucure
>>> EPT/NPT entries.
>>>            -> guest_memfd invokes IOMMU invalidation callback to zap the
>>> non-secure IOMMU entries.
>>>        3) guest_memfd invokes kvm_gmem_execute_work() which updates the
>>> shareability and then merges the folios if needed.
>>>        4) Userspace invokes IOMMU map operation to map the ranges in secure IOMMU.
>>
>>
>> Alright (although this zap+map is not necessary on the AMD hw).
> 
> IMO guest_memfd ideally should not directly interact or cater to arch
> specific needs, it should implement a mechanism that works for all
> archs. KVM/IOMMU implement invalidation callbacks and have all the
> architecture specific knowledge to take the right decisions.


Every page conversion will go through:

kvm-amd.ko -1-> guest_memfd (kvm.ko) -2-> iommufd.ko -3-> amd-iommu (build-in).

Which one decides on IOMMU not needing (un)mapping? Got to be (1) but then it need to propagate the decision to amd-iommu (and we do not have (3) at the moment in that path).

Or we just always do unmap+map (and trigger unwanted page huge page smashing)? All is doable and neither particularly horrible, I'm trying to see where the consensus is now. Thanks,


>>
>>> There should be a way to block external IOMMU pagetable updates while
>>> guest_memfd is performing conversion e.g. something like
>>> kvm_invalidate_begin()/end().
>>>
>>>>> is going to be flushed on a page conversion anyway (the RMPUPDATE
>>>>> instruction does that). All this is about AMD's x86 though.
>>>>
>>>> The iommu should not be using the VMA to manage the mapping. It should
>>>
>>> +1.
>>
>> Yeah, not doing this already, because I physically cannot map gmemfd's memory in IOMMU via VMA (which allocates memory via gup() so wrong memory is mapped in IOMMU). Thanks,
>>
>>
>>>> be directly linked to the guestmemfd in some way that does not disturb
>>>> its operations. I imagine there would be some kind of invalidation
>>>> callback directly to the iommu.
>>>>
>>>> Presumably that invalidation call back can include a reason for the
>>>> invalidation (addr change, shared/private conversion, etc)
>>>>
>>>> I'm not sure how we will figure out which case is which but guestmemfd
>>>> should allow the iommu to plug in either invalidation scheme..
>>>>
>>>> Probably invalidation should be a global to the FD thing, I imagine
>>>> that once invalidation is established the iommu will not be
>>>> incrementing page refcounts.
>>>
>>> +1.
>>
>> Alright. Thanks for the comments.
>>
>>>
>>>>
>>>> Jason
>>
>> --
>> Alexey
>>

-- 
Alexey