linux-kernel - Re: [RFC PATCH v2 04/51] KVM: guest_memfd: Introduce KVM_GMEM_CONVERT

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <aC9QPoEUw_nLHhV4@google.com>
Date: Thu, 22 May 2025 09:26:38 -0700
From: Sean Christopherson <seanjc@...gle.com>
To: Fuad Tabba <tabba@...gle.com>
Cc: Vishal Annapurve <vannapurve@...gle.com>, Ackerley Tng <ackerleytng@...gle.com>, kvm@...r.kernel.org, 
	linux-mm@...ck.org, linux-kernel@...r.kernel.org, x86@...nel.org, 
	linux-fsdevel@...r.kernel.org, aik@....com, ajones@...tanamicro.com, 
	akpm@...ux-foundation.org, amoorthy@...gle.com, anthony.yznaga@...cle.com, 
	anup@...infault.org, aou@...s.berkeley.edu, bfoster@...hat.com, 
	binbin.wu@...ux.intel.com, brauner@...nel.org, catalin.marinas@....com, 
	chao.p.peng@...el.com, chenhuacai@...nel.org, dave.hansen@...el.com, 
	david@...hat.com, dmatlack@...gle.com, dwmw@...zon.co.uk, 
	erdemaktas@...gle.com, fan.du@...el.com, fvdl@...gle.com, graf@...zon.com, 
	haibo1.xu@...el.com, hch@...radead.org, hughd@...gle.com, ira.weiny@...el.com, 
	isaku.yamahata@...el.com, jack@...e.cz, james.morse@....com, 
	jarkko@...nel.org, jgg@...pe.ca, jgowans@...zon.com, jhubbard@...dia.com, 
	jroedel@...e.de, jthoughton@...gle.com, jun.miao@...el.com, 
	kai.huang@...el.com, keirf@...gle.com, kent.overstreet@...ux.dev, 
	kirill.shutemov@...el.com, liam.merwick@...cle.com, 
	maciej.wieczor-retman@...el.com, mail@...iej.szmigiero.name, maz@...nel.org, 
	mic@...ikod.net, michael.roth@....com, mpe@...erman.id.au, 
	muchun.song@...ux.dev, nikunj@....com, nsaenz@...zon.es, 
	oliver.upton@...ux.dev, palmer@...belt.com, pankaj.gupta@....com, 
	paul.walmsley@...ive.com, pbonzini@...hat.com, pdurrant@...zon.co.uk, 
	peterx@...hat.com, pgonda@...gle.com, pvorel@...e.cz, qperret@...gle.com, 
	quic_cvanscha@...cinc.com, quic_eberman@...cinc.com, 
	quic_mnalajal@...cinc.com, quic_pderrin@...cinc.com, quic_pheragu@...cinc.com, 
	quic_svaddagi@...cinc.com, quic_tsoni@...cinc.com, richard.weiyang@...il.com, 
	rick.p.edgecombe@...el.com, rientjes@...gle.com, roypat@...zon.co.uk, 
	rppt@...nel.org, shuah@...nel.org, steven.price@....com, 
	steven.sistare@...cle.com, suzuki.poulose@....com, thomas.lendacky@....com, 
	usama.arif@...edance.com, vbabka@...e.cz, viro@...iv.linux.org.uk, 
	vkuznets@...hat.com, wei.w.wang@...el.com, will@...nel.org, 
	willy@...radead.org, xiaoyao.li@...el.com, yan.y.zhao@...el.com, 
	yilun.xu@...el.com, yuzenghui@...wei.com, zhiquan1.li@...el.com
Subject: Re: [RFC PATCH v2 04/51] KVM: guest_memfd: Introduce
 KVM_GMEM_CONVERT_SHARED/PRIVATE ioctls

On Thu, May 22, 2025, Fuad Tabba wrote:
> On Thu, 22 May 2025 at 15:52, Sean Christopherson <seanjc@...gle.com> wrote:
> > On Wed, May 21, 2025, Fuad Tabba wrote:
> > > How does the host userspace find that out? If the host userspace is capable
> > > of finding that out, then surely KVM is also capable of finding out the same.
> >
> > Nope, not on x86.  Well, not without userspace invoking a new ioctl, which would
> > defeat the purpose of adding these ioctls.
> >
> > KVM is only responsible for emulating/virtualizing the "CPU".  The chipset, e.g.
> > the PCI config space, is fully owned by userspace.  KVM doesn't even know whether
> > or not PCI exists for the VM.  And reboot may be emulated by simply creating a
> > new KVM instance, i.e. even if KVM was somehow aware of the reboot request, the
> > change in state would happen in an entirely new struct kvm.
> >
> > That said, Vishal and Ackerley, this patch is a bit lacking on the documentation
> > front.  The changelog asserts that:
> >
> >   A guest_memfd ioctl is used because shareability is a property of the memory,
> >   and this property should be modifiable independently of the attached struct kvm
> >
> > but then follows with a very weak and IMO largely irrelevant justification of:
> >
> >   This allows shareability to be modified even if the memory is not yet bound
> >   using memslots.
> >
> > Allowing userspace to change shareability without memslots is one relatively minor
> > flow in one very specific use case.
> >
> > The real justification for these ioctls is that fundamentally, shareability for
> > in-place conversions is a property of a guest_memfd instance and not a struct kvm
> > instance, and so needs to owned by guest_memfd.
> 
> Thanks for the clarification Sean. I have a couple of followup
> questions/comments that you might be able to help with:
> 
> From a conceptual point of view, I understand that the in-place conversion is
> a property of guest_memfd. But that doesn't necessarily mean that the
> interface between kvm <-> guest_memfd is a userspace IOCTL.

kvm and guest_memfd aren't the communication endpoints for in-place conversions,
and more importantly, kvm isn't part of the control plane.  kvm's primary role
(for guest_memfd with in-place conversions) is to manage the page tables to map
memory into the guest.

kvm *may* also explicitly provide a communication channel between the guest and
host, e.g. when conversions are initiated via hypercalls, but in some cases the
communication channel may be created through pre-existing mechanisms, e.g. a
shared memory buffer or emulated I/O (such as the PCI reset case).

  guest => kvm (dumb pipe) => userspace => guest_memfd => kvm (invalidate)

And in other cases, kvm might not be in that part of the picture at all, e.g. if
the userspace VMM provides an interface to the VM owner (which could also be the
user running the VM) to reset the VM, then the flow would look like:

  userspace => guest_memfd => kvm (invalidate)

A decent comparison is vCPUs.  KVM _could_ route all ioctls through the VM, but
that's unpleasant for all parties, as it'd be cumbersome for userspace, and
unnecessarily complex and messy for KVM.  Similarly, routing guest_memfd state
changes through KVM_SET_MEMORY_ATTRIBUTES is awkward from both design and mechanical
perspectives.

Even if we disagree on how ugly/pretty routing conversions through kvm would be,
which I'll allow is subjective, the bigger problem is that bouncing through
KVM_SET_MEMORY_ATTRIBUTES would create an unholy mess of an ABI.

Today, KVM_SET_MEMORY_ATTRIBUTES is handled entirely within kvm, and any changes
take effect irrespective of any memslot bindings.  And that didn't happen by
chance; preserving and enforcing attribute changes independently of memslots was
a key design requirement, precisely because memslots are ephemeral to a certain
extent.

Adding support for in-place guest_memfd conversion will require new ABI, and so
will be a "breaking" change for KVM_SET_MEMORY_ATTRIBUTES no matter what.  E.g.
KVM will need to reject KVM_MEMORY_ATTRIBUTE_PRIVATE for VMs that elect to use
in-place guest_memfd conversions.  But very critically, KVM can cripsly enumerate
the lack of KVM_MEMORY_ATTRIBUTE_PRIVATE via KVM_CAP_MEMORY_ATTRIBUTES, the
behavior will be very straightforward to document (e.g. CAP X is mutually excusive
with KVM_MEMORY_ATTRIBUTE_PRIVATE), and it will be opt-in, i.e. won't truly be a
breaking change.

If/when we move shareability to guest_memfd, routing state changes through
KVM_SET_MEMORY_ATTRIBUTES will gain a subtle dependency on userspace having to
create memslots in order for state changes to take effect.  That wrinkle would be
weird and annoying to document, e.g. "if CAP X is enabled, the ioctl ordering is
A => B => C, otherwise the ordering doesn't matter", and would create many more
conundrums:

  - If a memslot needs to exist in order for KVM_SET_MEMORY_ATTRIBUTES to take effect,
    what should happen if that memslot is deleted?
  - If a memslot isn't found, should KVM_SET_MEMORY_ATTRIBUTES fail and report
    an error, or silently do nothing?
  - If KVM_SET_MEMORY_ATTRIBUTES affects multiple memslots that are bound to
    multiple guest_memfd, how does KVM guarantee atomicity?  What happens if one
    guest_memfd conversion succeeds, but a later fails?

> We already communicate directly between the two. Other, even less related
> subsystems within the kernel also interact without going through userspace.
> Why can't we do the same here? I'm not suggesting it not be owned by
> guest_memfd, but that we communicate directly.

I'm not concerned about kvm communicating with guest_memfd, as you note it's all
KVM.  As above, my concerns are all about KVM's ABI and who owns/controls what.

> From a performance point of view, I would expect the common case to be that
> when KVM gets an unshare request from the guest, it would be able to unmap
> those pages from the (cooperative) host userspace, and return back to the
> guest. In this scenario, the host userspace wouldn't even need to be
> involved.

Hard NAK, at least from an x86 perspective.  Userspace is the sole decision maker
with respect to what memory is state of shared vs. private, full stop.  The guest
can make *requests* to convert memory, but ultimately it's host userspace that
decides whether or not to honor the request.

We've litigated this exact issue multiple times.  All state changes must be
controlled by userspace, because userspace is the only entity that can gracefully
handle exceptions and edge cases, and is the only entity with (almost) full
knowledge of the system.  We can discuss this again if necessary, but I'd much
prefer to not rehash all of those conversations.

> Having a userspace IOCTL as part of this makes that trip unnecessarily longer
> for the common case.

I'm very skeptical that an exit to userspace is going to even be measurable in
terms of the cost to convert memory.  Conversion is going to require multiple
locks, modifications to multiple sets of page tables with all the associated TLB
maintenance, possibly cache maintenance, and probably a few other things I'm
forgetting.  The cost of a few user<=>kernel transitions is likely going to be a
drop in the bucket.

If I'm wrong, and there are flows where the user<=>kernel transitions are the
long pole, then we could certainly exploring adding a way for userspace to opt
into a "fast path" conversion.  But it would need to be exactly that, an optional
fast path that can fall back to the "slow" userspace-driven conversion as needed.