linux-kernel - Re: [RFC PATCH v2 04/51] KVM: guest_memfd: Introduce KVM_GMEM_CONVERT

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CA+EHjTzMYSHKuxMJbpMx594RsL64aph1dWj06zx_01=ZuQU+Bg@mail.gmail.com>
Date: Fri, 23 May 2025 11:12:03 +0100
From: Fuad Tabba <tabba@...gle.com>
To: Sean Christopherson <seanjc@...gle.com>
Cc: Vishal Annapurve <vannapurve@...gle.com>, Ackerley Tng <ackerleytng@...gle.com>, kvm@...r.kernel.org, 
	linux-mm@...ck.org, linux-kernel@...r.kernel.org, x86@...nel.org, 
	linux-fsdevel@...r.kernel.org, aik@....com, ajones@...tanamicro.com, 
	akpm@...ux-foundation.org, amoorthy@...gle.com, anthony.yznaga@...cle.com, 
	anup@...infault.org, aou@...s.berkeley.edu, bfoster@...hat.com, 
	binbin.wu@...ux.intel.com, brauner@...nel.org, catalin.marinas@....com, 
	chao.p.peng@...el.com, chenhuacai@...nel.org, dave.hansen@...el.com, 
	david@...hat.com, dmatlack@...gle.com, dwmw@...zon.co.uk, 
	erdemaktas@...gle.com, fan.du@...el.com, fvdl@...gle.com, graf@...zon.com, 
	haibo1.xu@...el.com, hch@...radead.org, hughd@...gle.com, ira.weiny@...el.com, 
	isaku.yamahata@...el.com, jack@...e.cz, james.morse@....com, 
	jarkko@...nel.org, jgg@...pe.ca, jgowans@...zon.com, jhubbard@...dia.com, 
	jroedel@...e.de, jthoughton@...gle.com, jun.miao@...el.com, 
	kai.huang@...el.com, keirf@...gle.com, kent.overstreet@...ux.dev, 
	kirill.shutemov@...el.com, liam.merwick@...cle.com, 
	maciej.wieczor-retman@...el.com, mail@...iej.szmigiero.name, maz@...nel.org, 
	mic@...ikod.net, michael.roth@....com, mpe@...erman.id.au, 
	muchun.song@...ux.dev, nikunj@....com, nsaenz@...zon.es, 
	oliver.upton@...ux.dev, palmer@...belt.com, pankaj.gupta@....com, 
	paul.walmsley@...ive.com, pbonzini@...hat.com, pdurrant@...zon.co.uk, 
	peterx@...hat.com, pgonda@...gle.com, pvorel@...e.cz, qperret@...gle.com, 
	quic_cvanscha@...cinc.com, quic_eberman@...cinc.com, 
	quic_mnalajal@...cinc.com, quic_pderrin@...cinc.com, quic_pheragu@...cinc.com, 
	quic_svaddagi@...cinc.com, quic_tsoni@...cinc.com, richard.weiyang@...il.com, 
	rick.p.edgecombe@...el.com, rientjes@...gle.com, roypat@...zon.co.uk, 
	rppt@...nel.org, shuah@...nel.org, steven.price@....com, 
	steven.sistare@...cle.com, suzuki.poulose@....com, thomas.lendacky@....com, 
	usama.arif@...edance.com, vbabka@...e.cz, viro@...iv.linux.org.uk, 
	vkuznets@...hat.com, wei.w.wang@...el.com, will@...nel.org, 
	willy@...radead.org, xiaoyao.li@...el.com, yan.y.zhao@...el.com, 
	yilun.xu@...el.com, yuzenghui@...wei.com, zhiquan1.li@...el.com
Subject: Re: [RFC PATCH v2 04/51] KVM: guest_memfd: Introduce
 KVM_GMEM_CONVERT_SHARED/PRIVATE ioctls

Hi Sean,


On Thu, 22 May 2025 at 17:26, Sean Christopherson <seanjc@...gle.com> wrote:
>
> On Thu, May 22, 2025, Fuad Tabba wrote:
> > On Thu, 22 May 2025 at 15:52, Sean Christopherson <seanjc@...gle.com> wrote:
> > > On Wed, May 21, 2025, Fuad Tabba wrote:
> > > > How does the host userspace find that out? If the host userspace is capable
> > > > of finding that out, then surely KVM is also capable of finding out the same.
> > >
> > > Nope, not on x86.  Well, not without userspace invoking a new ioctl, which would
> > > defeat the purpose of adding these ioctls.
> > >
> > > KVM is only responsible for emulating/virtualizing the "CPU".  The chipset, e.g.
> > > the PCI config space, is fully owned by userspace.  KVM doesn't even know whether
> > > or not PCI exists for the VM.  And reboot may be emulated by simply creating a
> > > new KVM instance, i.e. even if KVM was somehow aware of the reboot request, the
> > > change in state would happen in an entirely new struct kvm.
> > >
> > > That said, Vishal and Ackerley, this patch is a bit lacking on the documentation
> > > front.  The changelog asserts that:
> > >
> > >   A guest_memfd ioctl is used because shareability is a property of the memory,
> > >   and this property should be modifiable independently of the attached struct kvm
> > >
> > > but then follows with a very weak and IMO largely irrelevant justification of:
> > >
> > >   This allows shareability to be modified even if the memory is not yet bound
> > >   using memslots.
> > >
> > > Allowing userspace to change shareability without memslots is one relatively minor
> > > flow in one very specific use case.
> > >
> > > The real justification for these ioctls is that fundamentally, shareability for
> > > in-place conversions is a property of a guest_memfd instance and not a struct kvm
> > > instance, and so needs to owned by guest_memfd.
> >
> > Thanks for the clarification Sean. I have a couple of followup
> > questions/comments that you might be able to help with:
> >
> > From a conceptual point of view, I understand that the in-place conversion is
> > a property of guest_memfd. But that doesn't necessarily mean that the
> > interface between kvm <-> guest_memfd is a userspace IOCTL.
>
> kvm and guest_memfd aren't the communication endpoints for in-place conversions,
> and more importantly, kvm isn't part of the control plane.  kvm's primary role
> (for guest_memfd with in-place conversions) is to manage the page tables to map
> memory into the guest.
>
> kvm *may* also explicitly provide a communication channel between the guest and
> host, e.g. when conversions are initiated via hypercalls, but in some cases the
> communication channel may be created through pre-existing mechanisms, e.g. a
> shared memory buffer or emulated I/O (such as the PCI reset case).
>
>   guest => kvm (dumb pipe) => userspace => guest_memfd => kvm (invalidate)
>
> And in other cases, kvm might not be in that part of the picture at all, e.g. if
> the userspace VMM provides an interface to the VM owner (which could also be the
> user running the VM) to reset the VM, then the flow would look like:
>
>   userspace => guest_memfd => kvm (invalidate)
>
> A decent comparison is vCPUs.  KVM _could_ route all ioctls through the VM, but
> that's unpleasant for all parties, as it'd be cumbersome for userspace, and
> unnecessarily complex and messy for KVM.  Similarly, routing guest_memfd state
> changes through KVM_SET_MEMORY_ATTRIBUTES is awkward from both design and mechanical
> perspectives.
>
> Even if we disagree on how ugly/pretty routing conversions through kvm would be,
> which I'll allow is subjective, the bigger problem is that bouncing through
> KVM_SET_MEMORY_ATTRIBUTES would create an unholy mess of an ABI.
>
> Today, KVM_SET_MEMORY_ATTRIBUTES is handled entirely within kvm, and any changes
> take effect irrespective of any memslot bindings.  And that didn't happen by
> chance; preserving and enforcing attribute changes independently of memslots was
> a key design requirement, precisely because memslots are ephemeral to a certain
> extent.
>
> Adding support for in-place guest_memfd conversion will require new ABI, and so
> will be a "breaking" change for KVM_SET_MEMORY_ATTRIBUTES no matter what.  E.g.
> KVM will need to reject KVM_MEMORY_ATTRIBUTE_PRIVATE for VMs that elect to use
> in-place guest_memfd conversions.  But very critically, KVM can cripsly enumerate
> the lack of KVM_MEMORY_ATTRIBUTE_PRIVATE via KVM_CAP_MEMORY_ATTRIBUTES, the
> behavior will be very straightforward to document (e.g. CAP X is mutually excusive
> with KVM_MEMORY_ATTRIBUTE_PRIVATE), and it will be opt-in, i.e. won't truly be a
> breaking change.
>
> If/when we move shareability to guest_memfd, routing state changes through
> KVM_SET_MEMORY_ATTRIBUTES will gain a subtle dependency on userspace having to
> create memslots in order for state changes to take effect.  That wrinkle would be
> weird and annoying to document, e.g. "if CAP X is enabled, the ioctl ordering is
> A => B => C, otherwise the ordering doesn't matter", and would create many more
> conundrums:
>
>   - If a memslot needs to exist in order for KVM_SET_MEMORY_ATTRIBUTES to take effect,
>     what should happen if that memslot is deleted?
>   - If a memslot isn't found, should KVM_SET_MEMORY_ATTRIBUTES fail and report
>     an error, or silently do nothing?
>   - If KVM_SET_MEMORY_ATTRIBUTES affects multiple memslots that are bound to
>     multiple guest_memfd, how does KVM guarantee atomicity?  What happens if one
>     guest_memfd conversion succeeds, but a later fails?
>
> > We already communicate directly between the two. Other, even less related
> > subsystems within the kernel also interact without going through userspace.
> > Why can't we do the same here? I'm not suggesting it not be owned by
> > guest_memfd, but that we communicate directly.
>
> I'm not concerned about kvm communicating with guest_memfd, as you note it's all
> KVM.  As above, my concerns are all about KVM's ABI and who owns/controls what.
>
> > From a performance point of view, I would expect the common case to be that
> > when KVM gets an unshare request from the guest, it would be able to unmap
> > those pages from the (cooperative) host userspace, and return back to the
> > guest. In this scenario, the host userspace wouldn't even need to be
> > involved.
>
> Hard NAK, at least from an x86 perspective.  Userspace is the sole decision maker
> with respect to what memory is state of shared vs. private, full stop.  The guest
> can make *requests* to convert memory, but ultimately it's host userspace that
> decides whether or not to honor the request.
>
> We've litigated this exact issue multiple times.  All state changes must be
> controlled by userspace, because userspace is the only entity that can gracefully
> handle exceptions and edge cases, and is the only entity with (almost) full
> knowledge of the system.  We can discuss this again if necessary, but I'd much
> prefer to not rehash all of those conversations.
>
> > Having a userspace IOCTL as part of this makes that trip unnecessarily longer
> > for the common case.
>
> I'm very skeptical that an exit to userspace is going to even be measurable in
> terms of the cost to convert memory.  Conversion is going to require multiple
> locks, modifications to multiple sets of page tables with all the associated TLB
> maintenance, possibly cache maintenance, and probably a few other things I'm
> forgetting.  The cost of a few user<=>kernel transitions is likely going to be a
> drop in the bucket.
>
> If I'm wrong, and there are flows where the user<=>kernel transitions are the
> long pole, then we could certainly exploring adding a way for userspace to opt
> into a "fast path" conversion.  But it would need to be exactly that, an optional
> fast path that can fall back to the "slow" userspace-driven conversion as needed.

Thanks for this very thorough explanation. I know that we have
litigated this issue, but not this _exact_ issue. My understanding was
that the main reason for using IOCTLs for memory attributes is that
userspace needs to manage private and shared memory seperately,
including allocation and punching holes where necessary.

That said, no need to discuss this again. If it turns out that
user<->kernel transitions are a bottleneck we could look into an
opt-in fast path as you said.

Cheers,
/fuad