lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <09db374e-fa7d-4c1d-bf03-aaaafd93bd01@amd.com>
Date: Thu, 10 Jul 2025 16:57:25 +1000
From: Alexey Kardashevskiy <aik@....com>
To: Vishal Annapurve <vannapurve@...gle.com>
Cc: Jason Gunthorpe <jgg@...pe.ca>, Fuad Tabba <tabba@...gle.com>,
 Ackerley Tng <ackerleytng@...gle.com>, kvm@...r.kernel.org,
 linux-mm@...ck.org, linux-kernel@...r.kernel.org, x86@...nel.org,
 linux-fsdevel@...r.kernel.org, ajones@...tanamicro.com,
 akpm@...ux-foundation.org, amoorthy@...gle.com, anthony.yznaga@...cle.com,
 anup@...infault.org, aou@...s.berkeley.edu, bfoster@...hat.com,
 binbin.wu@...ux.intel.com, brauner@...nel.org, catalin.marinas@....com,
 chao.p.peng@...el.com, chenhuacai@...nel.org, dave.hansen@...el.com,
 david@...hat.com, dmatlack@...gle.com, dwmw@...zon.co.uk,
 erdemaktas@...gle.com, fan.du@...el.com, fvdl@...gle.com, graf@...zon.com,
 haibo1.xu@...el.com, hch@...radead.org, hughd@...gle.com,
 ira.weiny@...el.com, isaku.yamahata@...el.com, jack@...e.cz,
 james.morse@....com, jarkko@...nel.org, jgowans@...zon.com,
 jhubbard@...dia.com, jroedel@...e.de, jthoughton@...gle.com,
 jun.miao@...el.com, kai.huang@...el.com, keirf@...gle.com,
 kent.overstreet@...ux.dev, kirill.shutemov@...el.com,
 liam.merwick@...cle.com, maciej.wieczor-retman@...el.com,
 mail@...iej.szmigiero.name, maz@...nel.org, mic@...ikod.net,
 michael.roth@....com, mpe@...erman.id.au, muchun.song@...ux.dev,
 nikunj@....com, nsaenz@...zon.es, oliver.upton@...ux.dev,
 palmer@...belt.com, pankaj.gupta@....com, paul.walmsley@...ive.com,
 pbonzini@...hat.com, pdurrant@...zon.co.uk, peterx@...hat.com,
 pgonda@...gle.com, pvorel@...e.cz, qperret@...gle.com,
 quic_cvanscha@...cinc.com, quic_eberman@...cinc.com,
 quic_mnalajal@...cinc.com, quic_pderrin@...cinc.com,
 quic_pheragu@...cinc.com, quic_svaddagi@...cinc.com, quic_tsoni@...cinc.com,
 richard.weiyang@...il.com, rick.p.edgecombe@...el.com, rientjes@...gle.com,
 roypat@...zon.co.uk, rppt@...nel.org, seanjc@...gle.com, shuah@...nel.org,
 steven.price@....com, steven.sistare@...cle.com, suzuki.poulose@....com,
 thomas.lendacky@....com, usama.arif@...edance.com, vbabka@...e.cz,
 viro@...iv.linux.org.uk, vkuznets@...hat.com, wei.w.wang@...el.com,
 will@...nel.org, willy@...radead.org, xiaoyao.li@...el.com,
 yan.y.zhao@...el.com, yilun.xu@...el.com, yuzenghui@...wei.com,
 zhiquan1.li@...el.com
Subject: Re: [RFC PATCH v2 04/51] KVM: guest_memfd: Introduce
 KVM_GMEM_CONVERT_SHARED/PRIVATE ioctls



On 1/7/25 00:19, Vishal Annapurve wrote:
> On Sun, Jun 29, 2025 at 5:19 PM Alexey Kardashevskiy <aik@....com> wrote:
>> ...
>>>>> ============================
>>>>>
>>>>> For IOMMU, could something like below work?
>>>>>
>>>>> * A new UAPI to bind IOMMU FDs with guest_memfd ranges
>>>>
>>>> Done that.
>>>>
>>>>> * VFIO_DMA_MAP/UNMAP operations modified to directly fetch pfns from
>>>>> guest_memfd ranges using kvm_gmem_get_pfn()
>>>>
>>>> This API imho should drop the confusing kvm_ prefix.
>>>>
>>>>>        -> kvm invokes kvm_gmem_is_private() to check for the range
>>>>> shareability, IOMMU could use the same or we could add an API in gmem
>>>>> that takes in access type and checks the shareability before returning
>>>>> the pfn.
>>>>
>>>> Right now I cutnpasted kvm_gmem_get_folio() (which essentially is filemap_lock_folio()/filemap_alloc_folio()/__filemap_add_folio()) to avoid new links between iommufd.ko and kvm.ko. It is probably unavoidable though.
>>>
>>> I don't think that's the way to avoid links between iommufd.ko and
>>> kvm.ko. Cleaner way probably is to have gmem logic built-in and allow
>>> runtime registration of invalidation callbacks from KVM/IOMMU
>>> backends. Need to think about this more.
>>
>> Yeah, otherwise iommufd.ko will have to install a hook in guest_memfd (==kvm.ko) in run time so more beloved symbol_get() :)
>>
>>>
>>>>
>>>>
>>>>> * IOMMU stack exposes an invalidation callback that can be invoked by
>>>>> guest_memfd.
>>>>>
>>>>> Private to Shared conversion via kvm_gmem_convert_range() -
>>>>>        1) guest_memfd invokes kvm_gmem_invalidate_begin() for the ranges
>>>>> on each bound memslot overlapping with the range
>>>>>         2) guest_memfd invokes kvm_gmem_convert_should_proceed() which
>>>>> actually unmaps the KVM SEPT/NPT entries.
>>>>>               -> guest_memfd invokes IOMMU invalidation callback to zap
>>>>> the secure IOMMU entries.
>>>>>         3) guest_memfd invokes kvm_gmem_execute_work() which updates the
>>>>> shareability and then splits the folios if needed
>>>>>         4) Userspace invokes IOMMU map operation to map the ranges in
>>>>> non-secure IOMMU.
>>>>>
>>>>> Shared to private conversion via kvm_gmem_convert_range() -
>>>>>        1) guest_memfd invokes kvm_gmem_invalidate_begin() for the ranges
>>>>> on each bound memslot overlapping with the range
>>>>>         2) guest_memfd invokes kvm_gmem_convert_should_proceed() which
>>>>> actually unmaps the host mappings which will unmap the KVM non-seucure
>>>>> EPT/NPT entries.
>>>>>             -> guest_memfd invokes IOMMU invalidation callback to zap the
>>>>> non-secure IOMMU entries.
>>>>>         3) guest_memfd invokes kvm_gmem_execute_work() which updates the
>>>>> shareability and then merges the folios if needed.
>>>>>         4) Userspace invokes IOMMU map operation to map the ranges in secure IOMMU.
>>>>
>>>>
>>>> Alright (although this zap+map is not necessary on the AMD hw).
>>>
>>> IMO guest_memfd ideally should not directly interact or cater to arch
>>> specific needs, it should implement a mechanism that works for all
>>> archs. KVM/IOMMU implement invalidation callbacks and have all the
>>> architecture specific knowledge to take the right decisions.
>>
>>
>> Every page conversion will go through:
>>
>> kvm-amd.ko -1-> guest_memfd (kvm.ko) -2-> iommufd.ko -3-> amd-iommu (build-in).
>>
>> Which one decides on IOMMU not needing (un)mapping? Got to be (1) but then it need to propagate the decision to amd-iommu (and we do not have (3) at the moment in that path).
> 
> If there is a need, guest_memfd can support two different callbacks:
> 1) Conversion notifier/callback invoked by guest_memfd during
> conversion handling.
> 2) Invalidation notifier/callback invoked by guest_memfd during truncation.
> 
> Iommufd/kvm can handle conversion callback/notifier as per the needs
> of underlying architecture. e.g. for TDX connect do the unmapping vs
> for SEV Trusted IO skip the unmapping.
> 
> Invalidation callback/notifier will need to be handled by unmapping page tables.
> 
>>
>> Or we just always do unmap+map (and trigger unwanted page huge page smashing)? All is doable and neither particularly horrible, I'm trying to see where the consensus is now. Thanks,
>>
> 
> I assume when you say huge page smashing, it means huge page NPT
> mapping getting split.
> 
> AFAIR, based on discussion with Michael during guest_memfd calls,
> stage2 NPT entries need to be of the same granularity as RMP tables
> for AMD SNP guests. i.e. huge page NPT mappings need to be smashed on
> the KVM side during conversion. So today guest_memfd sends
> invalidation notification to KVM for both conversion and truncation.
> Doesn't the same constraint for keeping IOMMU page tables at the same
> granularity as RMP tables hold for trusted IO?


Currently I handle this from the KVM with a hack to get IOPDE from AMD IOMMU so both 2MB RMP entry and IOPDE entries are smashed in one go in one of many firmwares running on EPYC, and atm this is too hacky to be posted even as an RFC. This likely needs to move to IOMMUFD then (via some callbacks) which could call AMD IOMMU which then would call that firmware (called "TMPM" and it is not the PSP which is "TSM), probably. Thanks,



-- 
Alexey


Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ