linux-kernel - Re: [RFC PATCH v2 04/51] KVM: guest_memfd: Introduce KVM_GMEM_CONVERT

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAGtprH_7jSpwF77j1GW8rjSrbtZZ2OW2iGck5=Wk67+VnF9vjQ@mail.gmail.com>
Date: Tue, 20 May 2025 09:02:50 -0700
From: Vishal Annapurve <vannapurve@...gle.com>
To: Fuad Tabba <tabba@...gle.com>
Cc: Ackerley Tng <ackerleytng@...gle.com>, kvm@...r.kernel.org, linux-mm@...ck.org, 
	linux-kernel@...r.kernel.org, x86@...nel.org, linux-fsdevel@...r.kernel.org, 
	aik@....com, ajones@...tanamicro.com, akpm@...ux-foundation.org, 
	amoorthy@...gle.com, anthony.yznaga@...cle.com, anup@...infault.org, 
	aou@...s.berkeley.edu, bfoster@...hat.com, binbin.wu@...ux.intel.com, 
	brauner@...nel.org, catalin.marinas@....com, chao.p.peng@...el.com, 
	chenhuacai@...nel.org, dave.hansen@...el.com, david@...hat.com, 
	dmatlack@...gle.com, dwmw@...zon.co.uk, erdemaktas@...gle.com, 
	fan.du@...el.com, fvdl@...gle.com, graf@...zon.com, haibo1.xu@...el.com, 
	hch@...radead.org, hughd@...gle.com, ira.weiny@...el.com, 
	isaku.yamahata@...el.com, jack@...e.cz, james.morse@....com, 
	jarkko@...nel.org, jgg@...pe.ca, jgowans@...zon.com, jhubbard@...dia.com, 
	jroedel@...e.de, jthoughton@...gle.com, jun.miao@...el.com, 
	kai.huang@...el.com, keirf@...gle.com, kent.overstreet@...ux.dev, 
	kirill.shutemov@...el.com, liam.merwick@...cle.com, 
	maciej.wieczor-retman@...el.com, mail@...iej.szmigiero.name, maz@...nel.org, 
	mic@...ikod.net, michael.roth@....com, mpe@...erman.id.au, 
	muchun.song@...ux.dev, nikunj@....com, nsaenz@...zon.es, 
	oliver.upton@...ux.dev, palmer@...belt.com, pankaj.gupta@....com, 
	paul.walmsley@...ive.com, pbonzini@...hat.com, pdurrant@...zon.co.uk, 
	peterx@...hat.com, pgonda@...gle.com, pvorel@...e.cz, qperret@...gle.com, 
	quic_cvanscha@...cinc.com, quic_eberman@...cinc.com, 
	quic_mnalajal@...cinc.com, quic_pderrin@...cinc.com, quic_pheragu@...cinc.com, 
	quic_svaddagi@...cinc.com, quic_tsoni@...cinc.com, richard.weiyang@...il.com, 
	rick.p.edgecombe@...el.com, rientjes@...gle.com, roypat@...zon.co.uk, 
	rppt@...nel.org, seanjc@...gle.com, shuah@...nel.org, steven.price@....com, 
	steven.sistare@...cle.com, suzuki.poulose@....com, thomas.lendacky@....com, 
	usama.arif@...edance.com, vbabka@...e.cz, viro@...iv.linux.org.uk, 
	vkuznets@...hat.com, wei.w.wang@...el.com, will@...nel.org, 
	willy@...radead.org, xiaoyao.li@...el.com, yan.y.zhao@...el.com, 
	yilun.xu@...el.com, yuzenghui@...wei.com, zhiquan1.li@...el.com
Subject: Re: [RFC PATCH v2 04/51] KVM: guest_memfd: Introduce
 KVM_GMEM_CONVERT_SHARED/PRIVATE ioctls

On Tue, May 20, 2025 at 7:34 AM Fuad Tabba <tabba@...gle.com> wrote:
>
> Hi Vishal,
>
> On Tue, 20 May 2025 at 15:11, Vishal Annapurve <vannapurve@...gle.com> wrote:
> >
> > On Tue, May 20, 2025 at 6:44 AM Fuad Tabba <tabba@...gle.com> wrote:
> > >
> > > Hi Vishal,
> > >
> > > On Tue, 20 May 2025 at 14:02, Vishal Annapurve <vannapurve@...gle.com> wrote:
> > > >
> > > > On Tue, May 20, 2025 at 2:23 AM Fuad Tabba <tabba@...gle.com> wrote:
> > > > >
> > > > > Hi Ackerley,
> > > > >
> > > > > On Thu, 15 May 2025 at 00:43, Ackerley Tng <ackerleytng@...gle.com> wrote:
> > > > > >
> > > > > > The two new guest_memfd ioctls KVM_GMEM_CONVERT_SHARED and
> > > > > > KVM_GMEM_CONVERT_PRIVATE convert the requested memory ranges to shared
> > > > > > and private respectively.
> > > > >
> > > > > I have a high level question about this particular patch and this
> > > > > approach for conversion: why do we need IOCTLs to manage conversion
> > > > > between private and shared?
> > > > >
> > > > > In the presentations I gave at LPC [1, 2], and in my latest patch
> > > > > series that performs in-place conversion [3] and the associated (by
> > > > > now outdated) state diagram [4], I didn't see the need to have a
> > > > > userspace-facing interface to manage that. KVM has all the information
> > > > > it needs to handle conversions, which are triggered by the guest. To
> > > > > me this seems like it adds additional complexity, as well as a user
> > > > > facing interface that we would need to maintain.
> > > > >
> > > > > There are various ways we could handle conversion without explicit
> > > > > interference from userspace. What I had in mind is the following (as
> > > > > an example, details can vary according to VM type). I will use use the
> > > > > case of conversion from shared to private because that is the more
> > > > > complicated (interesting) case:
> > > > >
> > > > > - Guest issues a hypercall to request that a shared folio become private.
> > > > >
> > > > > - The hypervisor receives the call, and passes it to KVM.
> > > > >
> > > > > - KVM unmaps the folio from the guest stage-2 (EPT I think in x86
> > > > > parlance), and unmaps it from the host. The host however, could still
> > > > > have references (e.g., GUP).
> > > > >
> > > > > - KVM exits to the host (hypervisor call exit), with the information
> > > > > that the folio has been unshared from it.
> > > > >
> > > > > - A well behaving host would now get rid of all of its references
> > > > > (e.g., release GUPs), perform a VCPU run, and the guest continues
> > > > > running as normal. I expect this to be the common case.
> > > > >
> > > > > But to handle the more interesting situation, let's say that the host
> > > > > doesn't do it immediately, and for some reason it holds on to some
> > > > > references to that folio.
> > > > >
> > > > > - Even if that's the case, the guest can still run *. If the guest
> > > > > tries to access the folio, KVM detects that access when it tries to
> > > > > fault it into the guest, sees that the host still has references to
> > > > > that folio, and exits back to the host with a memory fault exit. At
> > > > > this point, the VCPU that has tried to fault in that particular folio
> > > > > cannot continue running as long as it cannot fault in that folio.
> > > >
> > > > Are you talking about the following scheme?
> > > > 1) guest_memfd checks shareability on each get pfn and if there is a
> > > > mismatch exit to the host.
> > >
> > > I think we are not really on the same page here (no pun intended :) ).
> > > I'll try to answer your questions anyway...
> > >
> > > Which get_pfn? Are you referring to get_pfn when faulting the page
> > > into the guest or into the host?
> >
> > I am referring to guest fault handling in KVM.
> >
> > >
> > > > 2) host user space has to guess whether it's a pending refcount or
> > > > whether it's an actual mismatch.
> > >
> > > No need to guess. VCPU run will let it know exactly why it's exiting.
> > >
> > > > 3) guest_memfd will maintain a third state
> > > > "pending_private_conversion" or equivalent which will transition to
> > > > private upon the last refcount drop of each page.
> > > >
> > > > If conversion is triggered by userspace (in case of pKVM, it will be
> > > > triggered from within the KVM (?)):
> > >
> > > Why would conversion be triggered by userspace? As far as I know, it's
> > > the guest that triggers the conversion.
> > >
> > > > * Conversion will just fail if there are extra refcounts and userspace
> > > > can try to get rid of extra refcounts on the range while it has enough
> > > > context without hitting any ambiguity with memory fault exit.
> > > > * guest_memfd will not have to deal with this extra state from 3 above
> > > > and overall guest_memfd conversion handling becomes relatively
> > > > simpler.
> > >
> > > That's not really related. The extra state isn't necessary any more
> > > once we agreed in the previous discussion that we will retry instead.
> >
> > Who is *we* here? Which entity will retry conversion?
>
> Userspace will re-attempt the VCPU run.

Then KVM will have to keep track of the ranges that need conversion
across exits. I think it's cleaner to let userspace make the decision
and invoke conversion without carrying additional state in KVM about
guest request.

>
> > >
> > > > Note that for x86 CoCo cases, memory conversion is already triggered
> > > > by userspace using KVM ioctl, this series is proposing to use
> > > > guest_memfd ioctl to do the same.
> > >
> > > The reason why for x86 CoCo cases conversion is already triggered by
> > > userspace using KVM ioctl is that it has to, since shared memory and
> > > private memory are two separate pages, and userspace needs to manage
> > > that. Sharing memory in place removes the need for that.
> >
> > Userspace still needs to clean up memory usage before conversion is
> > successful. e.g. remove IOMMU mappings for shared to private
> > conversion. I would think that memory conversion should not succeed
> > before all existing users let go of the guest_memfd pages for the
> > range being converted.
>
> Yes. Userspace will know that it needs to do that on the VCPU exit,
> which informs it of the guest's hypervisor request to unshare (convert
> from shared to private) the page.
>
> > In x86 CoCo usecases, userspace can also decide to not allow
> > conversion for scenarios where ranges are still under active use by
> > the host and guest is erroneously trying to take away memory. Both
> > SNP/TDX spec allow failure of conversion due to in use memory.
>
> How can the guest erroneously try to take away memory? If the guest
> sends a hypervisor request asking for a conversion of memory that
> doesn't belong to it, then I would expect the hypervisor to prevent
> that.

Making a range as private is effectively disallowing host from
accessing those ranges -> so taking away memory.

>
> I don't see how having an IOCTL to trigger the conversion is needed to
> allow conversion failure. How is that different from userspace
> ignoring or delaying releasing all references it has for the
> conversion request?
>
> > >
> > > This series isn't using the same ioctl, it's introducing new ones to
> > > perform a task that as far as I can tell so far, KVM can handle by
> > > itself.
> >
> > I would like to understand this better. How will KVM handle the
> > conversion process for guest_memfd pages? Can you help walk an example
> > sequence for shared to private conversion specifically around
> > guest_memfd offset states?
>
> To make sure that we are discussing the same scenario: can you do the
> same as well please --- walk me through an example sequence for shared
> to private conversion specifically around guest_memfd offset states
> With the IOCTLs involved?
>
> Here is an example that I have implemented and tested with pKVM. Note
> that there are alternatives, the flow below is architecture or even
> vm-type dependent. None of this code is code KVM code and the
> behaviour could vary.
>
>
> Assuming the folio is shared with the host:
>
> Guest sends unshare hypercall to the hypervisor
> Hypervisor forwards request to KVM (gmem) (having done due diligence)
> KVM (gmem) performs an unmap_folio(), exits to userspace with

For x86 CoCo VM usecases I was talking about, userspace would like to
avoid unmap_mapping_range() on the range before it's safe to unshare
the range.

> KVM_EXIT_UNSHARE and all the information about the folio being
> unshared
>
> Case 1:
> Userspace removes any remaining references (GUPs, IOMMU Mappings etc...)
> Userspace calls vcpu_run(): KVM (gmem) sees that there aren't any
> references, sets state to PRIVATE
>
> Case 2 (alternative 1):
> Userspace doesn't release its references
> Userspace calls vcpu_run(): KVM (gmem) sees that there are still
> references, exits back to userspace with KVM_EXIT_UNSHARE
>
> Case 2 (alternative 2):
> Userspace doesn't release its references
> Userspace calls vcpu_run(): KVM (gmem) sees that there are still
> references, unmaps folio from guest, but allows it to run (until it
> tries to fault in the folio)
> Guest tries to fault in folio that still has reference, KVM does not
> allow that (it sees that the folio is shared, and it doesn't fault in
> shared folios to confidential guests)
> KVM exits back to userspace with KVM_EXIT_UNSHARE
>
> As I mentioned, the alternatives above are _not_ set in core KVM code.
> They can vary by architecture of VM type, depending on the policy,
> support, etc..
>
> Now for your example please on how this would work with IOCTLs :)
>
> Thanks,
> /fuad
>
> > >
> > > >  - Allows not having to keep track of separate shared/private range
> > > > information in KVM.
> > >
> > > This patch series is already tracking shared/private range information in KVM.
> > >
> > > >  - Simpler handling of the conversion process done per guest_memfd
> > > > rather than for full range.
> > > >      - Userspace can handle the rollback as needed, simplifying error
> > > > handling in guest_memfd.
> > > >  - guest_memfd is single source of truth and notifies the users of
> > > > shareability change.
> > > >      - e.g. IOMMU, userspace, KVM MMU all can be registered for
> > > > getting notifications from guest_memfd directly and will get notified
> > > > for invalidation upon shareability attribute updates.
> > >
> > > All of these can still be done without introducing a new ioctl.
> > >
> > > Cheers,
> > > /fuad