[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <aC86OsU2HSFZkJP6@google.com>
Date: Thu, 22 May 2025 07:52:42 -0700
From: Sean Christopherson <seanjc@...gle.com>
To: Fuad Tabba <tabba@...gle.com>
Cc: Vishal Annapurve <vannapurve@...gle.com>, Ackerley Tng <ackerleytng@...gle.com>, kvm@...r.kernel.org,
linux-mm@...ck.org, linux-kernel@...r.kernel.org, x86@...nel.org,
linux-fsdevel@...r.kernel.org, aik@....com, ajones@...tanamicro.com,
akpm@...ux-foundation.org, amoorthy@...gle.com, anthony.yznaga@...cle.com,
anup@...infault.org, aou@...s.berkeley.edu, bfoster@...hat.com,
binbin.wu@...ux.intel.com, brauner@...nel.org, catalin.marinas@....com,
chao.p.peng@...el.com, chenhuacai@...nel.org, dave.hansen@...el.com,
david@...hat.com, dmatlack@...gle.com, dwmw@...zon.co.uk,
erdemaktas@...gle.com, fan.du@...el.com, fvdl@...gle.com, graf@...zon.com,
haibo1.xu@...el.com, hch@...radead.org, hughd@...gle.com, ira.weiny@...el.com,
isaku.yamahata@...el.com, jack@...e.cz, james.morse@....com,
jarkko@...nel.org, jgg@...pe.ca, jgowans@...zon.com, jhubbard@...dia.com,
jroedel@...e.de, jthoughton@...gle.com, jun.miao@...el.com,
kai.huang@...el.com, keirf@...gle.com, kent.overstreet@...ux.dev,
kirill.shutemov@...el.com, liam.merwick@...cle.com,
maciej.wieczor-retman@...el.com, mail@...iej.szmigiero.name, maz@...nel.org,
mic@...ikod.net, michael.roth@....com, mpe@...erman.id.au,
muchun.song@...ux.dev, nikunj@....com, nsaenz@...zon.es,
oliver.upton@...ux.dev, palmer@...belt.com, pankaj.gupta@....com,
paul.walmsley@...ive.com, pbonzini@...hat.com, pdurrant@...zon.co.uk,
peterx@...hat.com, pgonda@...gle.com, pvorel@...e.cz, qperret@...gle.com,
quic_cvanscha@...cinc.com, quic_eberman@...cinc.com,
quic_mnalajal@...cinc.com, quic_pderrin@...cinc.com, quic_pheragu@...cinc.com,
quic_svaddagi@...cinc.com, quic_tsoni@...cinc.com, richard.weiyang@...il.com,
rick.p.edgecombe@...el.com, rientjes@...gle.com, roypat@...zon.co.uk,
rppt@...nel.org, shuah@...nel.org, steven.price@....com,
steven.sistare@...cle.com, suzuki.poulose@....com, thomas.lendacky@....com,
usama.arif@...edance.com, vbabka@...e.cz, viro@...iv.linux.org.uk,
vkuznets@...hat.com, wei.w.wang@...el.com, will@...nel.org,
willy@...radead.org, xiaoyao.li@...el.com, yan.y.zhao@...el.com,
yilun.xu@...el.com, yuzenghui@...wei.com, zhiquan1.li@...el.com
Subject: Re: [RFC PATCH v2 04/51] KVM: guest_memfd: Introduce
KVM_GMEM_CONVERT_SHARED/PRIVATE ioctls
On Wed, May 21, 2025, Fuad Tabba wrote:
> On Wed, 21 May 2025 at 16:51, Vishal Annapurve <vannapurve@...gle.com> wrote:
> > On Wed, May 21, 2025 at 8:22 AM Fuad Tabba <tabba@...gle.com> wrote:
> > > On Wed, 21 May 2025 at 15:42, Vishal Annapurve <vannapurve@...gle.com> wrote:
> > > > On Wed, May 21, 2025 at 5:36 AM Fuad Tabba <tabba@...gle.com> wrote:
> > > > There are a bunch of complexities here, reboot sequence on x86 can be
> > > > triggered using multiple ways that I don't fully understand, but few
> > > > of them include reading/writing to "reset register" in MMIO/PCI config
> > > > space that are emulated by the host userspace directly. Host has to
> > > > know when the guest is shutting down to manage it's lifecycle.
> > >
> > > In that case, I think we need to fully understand these complexities
> > > before adding new IOCTLs. It could be that once we understand these
> > > issues, we find that we don't need these IOCTLs. It's hard to justify
> > > adding an IOCTL for something we don't understand.
> > >
> >
> > I don't understand all the ways x86 guest can trigger reboot but I do
> > know that x86 CoCo linux guest kernel triggers reset using MMIO/PCI
> > config register write that is emulated by host userspace.
> >
> > > > x86 CoCo VM firmwares don't support warm/soft reboot and even if it
> > > > does in future, guest kernel can choose a different reboot mechanism.
> > > > So guest reboot needs to be emulated by always starting from scratch.
> > > > This sequence needs initial guest firmware payload to be installed
> > > > into private ranges of guest_memfd.
> > > >
> > > > >
> > > > > Either the host doesn't (or cannot even) know that the guest is
> > > > > rebooting, in which case I don't see how having an IOCTL would help.
> > > >
> > > > Host does know that the guest is rebooting.
> > >
> > > In that case, that (i.e., the host finding out that the guest is
> > > rebooting) could trigger the conversion back to private. No need for an
> > > IOCTL.
> >
> > In the reboot scenarios, it's the host userspace finding out that the guest
> > kernel wants to reboot.
>
> How does the host userspace find that out? If the host userspace is capable
> of finding that out, then surely KVM is also capable of finding out the same.
Nope, not on x86. Well, not without userspace invoking a new ioctl, which would
defeat the purpose of adding these ioctls.
KVM is only responsible for emulating/virtualizing the "CPU". The chipset, e.g.
the PCI config space, is fully owned by userspace. KVM doesn't even know whether
or not PCI exists for the VM. And reboot may be emulated by simply creating a
new KVM instance, i.e. even if KVM was somehow aware of the reboot request, the
change in state would happen in an entirely new struct kvm.
That said, Vishal and Ackerley, this patch is a bit lacking on the documentation
front. The changelog asserts that:
A guest_memfd ioctl is used because shareability is a property of the memory,
and this property should be modifiable independently of the attached struct kvm
but then follows with a very weak and IMO largely irrelevant justification of:
This allows shareability to be modified even if the memory is not yet bound
using memslots.
Allowing userspace to change shareability without memslots is one relatively minor
flow in one very specific use case.
The real justification for these ioctls is that fundamentally, shareability for
in-place conversions is a property of a guest_memfd instance and not a struct kvm
instance, and so needs to owned by guest_memfd.
I.e. focus on justifying the change from a design and conceptual perspective,
not from a mechanical perspective of a flow that likely's somewhat unique to our
specific environment. Y'all are getting deep into the weeds on a random aspect
of x86 platform architecture, instead of focusing on the overall design.
The other issue that's likely making this more confusing than it needs to be is
that this series is actually two completely different series bundled into one,
with very little explanation. Moving shared vs. private ownership into
guest_memfd isn't a requirement for 1GiB support, it's a requirement for in-place
shared/private conversion in guest_memfd.
For the current guest_memfd implementation, shared vs. private is tracked in the
VM via memory attributes, because a guest_memfd instance is *only* private. I.e.
shared vs. private is a property of the VM, not of the guest_memfd instance. But
when in-place conversion support comes along, ownership of that particular
attribute needs to shift to the guest_memfd instance.
I know I gave feedback on earlier posting about there being too series flying
around, but shoving two distinct concepts into a single series is not the answer.
My complaints about too much noise wasn't that there were multiple series, it was
that there was very little coordination and lots of chaos.
If you split this series in two, which should be trivial since you've already
organized the patches as a split, then sans the selftests (thank you for those!),
in-place conversion support will be its own (much smaller!) series that can focus
on that specific aspect of the design, and can provide a cover letter that
expounds on the design goals and uAPI.
KVM: guest_memfd: Add CAP KVM_CAP_GMEM_CONVERSION
KVM: Query guest_memfd for private/shared status
KVM: guest_memfd: Skip LRU for guest_memfd folios
KVM: guest_memfd: Introduce KVM_GMEM_CONVERT_SHARED/PRIVATE ioctls
KVM: guest_memfd: Introduce and use shareability to guard faulting
KVM: guest_memfd: Make guest mem use guest mem inodes instead of anonymous inodes
And then you can post the 1GiB series separately. So long as you provide pointers
to dependencies along with a link to a repo+branch with the kitchen sink, I won't
complain about things being too chaotic :-)
Powered by blists - more mailing lists