linux-kernel - Re: [RFC PATCH v2 04/51] KVM: guest_memfd: Introduce KVM_GMEM_CONVERT

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CA+EHjTxjt-mb_WbtVymaBvCb1EdJAVMV_uGb4xDs_ewg4k0C4g@mail.gmail.com>
Date: Thu, 22 May 2025 16:07:29 +0100
From: Fuad Tabba <tabba@...gle.com>
To: Sean Christopherson <seanjc@...gle.com>
Cc: Vishal Annapurve <vannapurve@...gle.com>, Ackerley Tng <ackerleytng@...gle.com>, kvm@...r.kernel.org, 
	linux-mm@...ck.org, linux-kernel@...r.kernel.org, x86@...nel.org, 
	linux-fsdevel@...r.kernel.org, aik@....com, ajones@...tanamicro.com, 
	akpm@...ux-foundation.org, amoorthy@...gle.com, anthony.yznaga@...cle.com, 
	anup@...infault.org, aou@...s.berkeley.edu, bfoster@...hat.com, 
	binbin.wu@...ux.intel.com, brauner@...nel.org, catalin.marinas@....com, 
	chao.p.peng@...el.com, chenhuacai@...nel.org, dave.hansen@...el.com, 
	david@...hat.com, dmatlack@...gle.com, dwmw@...zon.co.uk, 
	erdemaktas@...gle.com, fan.du@...el.com, fvdl@...gle.com, graf@...zon.com, 
	haibo1.xu@...el.com, hch@...radead.org, hughd@...gle.com, ira.weiny@...el.com, 
	isaku.yamahata@...el.com, jack@...e.cz, james.morse@....com, 
	jarkko@...nel.org, jgg@...pe.ca, jgowans@...zon.com, jhubbard@...dia.com, 
	jroedel@...e.de, jthoughton@...gle.com, jun.miao@...el.com, 
	kai.huang@...el.com, keirf@...gle.com, kent.overstreet@...ux.dev, 
	kirill.shutemov@...el.com, liam.merwick@...cle.com, 
	maciej.wieczor-retman@...el.com, mail@...iej.szmigiero.name, maz@...nel.org, 
	mic@...ikod.net, michael.roth@....com, mpe@...erman.id.au, 
	muchun.song@...ux.dev, nikunj@....com, nsaenz@...zon.es, 
	oliver.upton@...ux.dev, palmer@...belt.com, pankaj.gupta@....com, 
	paul.walmsley@...ive.com, pbonzini@...hat.com, pdurrant@...zon.co.uk, 
	peterx@...hat.com, pgonda@...gle.com, pvorel@...e.cz, qperret@...gle.com, 
	quic_cvanscha@...cinc.com, quic_eberman@...cinc.com, 
	quic_mnalajal@...cinc.com, quic_pderrin@...cinc.com, quic_pheragu@...cinc.com, 
	quic_svaddagi@...cinc.com, quic_tsoni@...cinc.com, richard.weiyang@...il.com, 
	rick.p.edgecombe@...el.com, rientjes@...gle.com, roypat@...zon.co.uk, 
	rppt@...nel.org, shuah@...nel.org, steven.price@....com, 
	steven.sistare@...cle.com, suzuki.poulose@....com, thomas.lendacky@....com, 
	usama.arif@...edance.com, vbabka@...e.cz, viro@...iv.linux.org.uk, 
	vkuznets@...hat.com, wei.w.wang@...el.com, will@...nel.org, 
	willy@...radead.org, xiaoyao.li@...el.com, yan.y.zhao@...el.com, 
	yilun.xu@...el.com, yuzenghui@...wei.com, zhiquan1.li@...el.com
Subject: Re: [RFC PATCH v2 04/51] KVM: guest_memfd: Introduce
 KVM_GMEM_CONVERT_SHARED/PRIVATE ioctls

Hi Sean,

On Thu, 22 May 2025 at 15:52, Sean Christopherson <seanjc@...gle.com> wrote:
>
> On Wed, May 21, 2025, Fuad Tabba wrote:
> > On Wed, 21 May 2025 at 16:51, Vishal Annapurve <vannapurve@...gle.com> wrote:
> > > On Wed, May 21, 2025 at 8:22 AM Fuad Tabba <tabba@...gle.com> wrote:
> > > > On Wed, 21 May 2025 at 15:42, Vishal Annapurve <vannapurve@...gle.com> wrote:
> > > > > On Wed, May 21, 2025 at 5:36 AM Fuad Tabba <tabba@...gle.com> wrote:
> > > > > There are a bunch of complexities here, reboot sequence on x86 can be
> > > > > triggered using multiple ways that I don't fully understand, but few
> > > > > of them include reading/writing to "reset register" in MMIO/PCI config
> > > > > space that are emulated by the host userspace directly. Host has to
> > > > > know when the guest is shutting down to manage it's lifecycle.
> > > >
> > > > In that case, I think we need to fully understand these complexities
> > > > before adding new IOCTLs. It could be that once we understand these
> > > > issues, we find that we don't need these IOCTLs. It's hard to justify
> > > > adding an IOCTL for something we don't understand.
> > > >
> > >
> > > I don't understand all the ways x86 guest can trigger reboot but I do
> > > know that x86 CoCo linux guest kernel triggers reset using MMIO/PCI
> > > config register write that is emulated by host userspace.
> > >
> > > > > x86 CoCo VM firmwares don't support warm/soft reboot and even if it
> > > > > does in future, guest kernel can choose a different reboot mechanism.
> > > > > So guest reboot needs to be emulated by always starting from scratch.
> > > > > This sequence needs initial guest firmware payload to be installed
> > > > > into private ranges of guest_memfd.
> > > > >
> > > > > >
> > > > > > Either the host doesn't (or cannot even) know that the guest is
> > > > > > rebooting, in which case I don't see how having an IOCTL would help.
> > > > >
> > > > > Host does know that the guest is rebooting.
> > > >
> > > > In that case, that (i.e., the host finding out that the guest is
> > > > rebooting) could trigger the conversion back to private. No need for an
> > > > IOCTL.
> > >
> > > In the reboot scenarios, it's the host userspace finding out that the guest
> > > kernel wants to reboot.
> >
> > How does the host userspace find that out? If the host userspace is capable
> > of finding that out, then surely KVM is also capable of finding out the same.
>
> Nope, not on x86.  Well, not without userspace invoking a new ioctl, which would
> defeat the purpose of adding these ioctls.
>
> KVM is only responsible for emulating/virtualizing the "CPU".  The chipset, e.g.
> the PCI config space, is fully owned by userspace.  KVM doesn't even know whether
> or not PCI exists for the VM.  And reboot may be emulated by simply creating a
> new KVM instance, i.e. even if KVM was somehow aware of the reboot request, the
> change in state would happen in an entirely new struct kvm.
>
> That said, Vishal and Ackerley, this patch is a bit lacking on the documentation
> front.  The changelog asserts that:
>
>   A guest_memfd ioctl is used because shareability is a property of the memory,
>   and this property should be modifiable independently of the attached struct kvm
>
> but then follows with a very weak and IMO largely irrelevant justification of:
>
>   This allows shareability to be modified even if the memory is not yet bound
>   using memslots.
>
> Allowing userspace to change shareability without memslots is one relatively minor
> flow in one very specific use case.
>
> The real justification for these ioctls is that fundamentally, shareability for
> in-place conversions is a property of a guest_memfd instance and not a struct kvm
> instance, and so needs to owned by guest_memfd.

Thanks for the clarification Sean. I have a couple of followup
questions/comments that you might be able to help with:

>From a conceptual point of view, I understand that the in-place
conversion is a property of guest_memfd. But that doesn't necessarily
mean that the interface between kvm <-> guest_memfd is a userspace
IOCTL. We already communicate directly between the two. Other, even
less related subsystems within the kernel also interact without going
through userspace. Why can't we do the same here? I'm not suggesting
it not be owned by guest_memfd, but that we communicate directly.

>From a performance point of view, I would expect the common case to be
that when KVM gets an unshare request from the guest, it would be able
to unmap those pages from the (cooperative) host userspace, and return
back to the guest. In this scenario, the host userspace wouldn't even
need to be involved. Having a userspace IOCTL as part of this makes
that trip unnecessarily longer for the common case.

Cheers,
/fuad

> I.e. focus on justifying the change from a design and conceptual perspective,
> not from a mechanical perspective of a flow that likely's somewhat unique to our
> specific environment.  Y'all are getting deep into the weeds on a random aspect
> of x86 platform architecture, instead of focusing on the overall design.
>
> The other issue that's likely making this more confusing than it needs to be is
> that this series is actually two completely different series bundled into one,
> with very little explanation.  Moving shared vs. private ownership into
> guest_memfd isn't a requirement for 1GiB support, it's a requirement for in-place
> shared/private conversion in guest_memfd.
>
> For the current guest_memfd implementation, shared vs. private is tracked in the
> VM via memory attributes, because a guest_memfd instance is *only* private.  I.e.
> shared vs. private is a property of the VM, not of the guest_memfd instance.  But
> when in-place conversion support comes along, ownership of that particular
> attribute needs to shift to the guest_memfd instance.
>
> I know I gave feedback on earlier posting about there being too series flying
> around, but shoving two distinct concepts into a single series is not the answer.
> My complaints about too much noise wasn't that there were multiple series, it was
> that there was very little coordination and lots of chaos.
>
> If you split this series in two, which should be trivial since you've already
> organized the patches as a split, then sans the selftests (thank you for those!),
> in-place conversion support will be its own (much smaller!) series that can focus
> on that specific aspect of the design, and can provide a cover letter that
> expounds on the design goals and uAPI.
>
>   KVM: guest_memfd: Add CAP KVM_CAP_GMEM_CONVERSION
>   KVM: Query guest_memfd for private/shared status
>   KVM: guest_memfd: Skip LRU for guest_memfd folios
>   KVM: guest_memfd: Introduce KVM_GMEM_CONVERT_SHARED/PRIVATE ioctls
>   KVM: guest_memfd: Introduce and use shareability to guard faulting
>   KVM: guest_memfd: Make guest mem use guest mem inodes instead of anonymous inodes
>
> And then you can post the 1GiB series separately.  So long as you provide pointers
> to dependencies along with a link to a repo+branch with the kitchen sink, I won't
> complain about things being too chaotic :-)