linux-kernel - Re: [RFC PATCH v2 02/51] KVM: guest_memfd: Introduce and use shareability to guard faulting

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20250703203944.lhpyzu7elgqmplkl@amd.com>
Date: Thu, 3 Jul 2025 15:39:44 -0500
From: Michael Roth <michael.roth@....com>
To: Vishal Annapurve <vannapurve@...gle.com>
CC: Ackerley Tng <ackerleytng@...gle.com>, <kvm@...r.kernel.org>,
	<linux-mm@...ck.org>, <linux-kernel@...r.kernel.org>, <x86@...nel.org>,
	<linux-fsdevel@...r.kernel.org>, <aik@....com>, <ajones@...tanamicro.com>,
	<akpm@...ux-foundation.org>, <amoorthy@...gle.com>,
	<anthony.yznaga@...cle.com>, <anup@...infault.org>, <aou@...s.berkeley.edu>,
	<bfoster@...hat.com>, <binbin.wu@...ux.intel.com>, <brauner@...nel.org>,
	<catalin.marinas@....com>, <chao.p.peng@...el.com>, <chenhuacai@...nel.org>,
	<dave.hansen@...el.com>, <david@...hat.com>, <dmatlack@...gle.com>,
	<dwmw@...zon.co.uk>, <erdemaktas@...gle.com>, <fan.du@...el.com>,
	<fvdl@...gle.com>, <graf@...zon.com>, <haibo1.xu@...el.com>,
	<hch@...radead.org>, <hughd@...gle.com>, <ira.weiny@...el.com>,
	<isaku.yamahata@...el.com>, <jack@...e.cz>, <james.morse@....com>,
	<jarkko@...nel.org>, <jgg@...pe.ca>, <jgowans@...zon.com>,
	<jhubbard@...dia.com>, <jroedel@...e.de>, <jthoughton@...gle.com>,
	<jun.miao@...el.com>, <kai.huang@...el.com>, <keirf@...gle.com>,
	<kent.overstreet@...ux.dev>, <kirill.shutemov@...el.com>,
	<liam.merwick@...cle.com>, <maciej.wieczor-retman@...el.com>,
	<mail@...iej.szmigiero.name>, <maz@...nel.org>, <mic@...ikod.net>,
	<mpe@...erman.id.au>, <muchun.song@...ux.dev>, <nikunj@....com>,
	<nsaenz@...zon.es>, <oliver.upton@...ux.dev>, <palmer@...belt.com>,
	<pankaj.gupta@....com>, <paul.walmsley@...ive.com>, <pbonzini@...hat.com>,
	<pdurrant@...zon.co.uk>, <peterx@...hat.com>, <pgonda@...gle.com>,
	<pvorel@...e.cz>, <qperret@...gle.com>, <quic_cvanscha@...cinc.com>,
	<quic_eberman@...cinc.com>, <quic_mnalajal@...cinc.com>,
	<quic_pderrin@...cinc.com>, <quic_pheragu@...cinc.com>,
	<quic_svaddagi@...cinc.com>, <quic_tsoni@...cinc.com>,
	<richard.weiyang@...il.com>, <rick.p.edgecombe@...el.com>,
	<rientjes@...gle.com>, <roypat@...zon.co.uk>, <rppt@...nel.org>,
	<seanjc@...gle.com>, <shuah@...nel.org>, <steven.price@....com>,
	<steven.sistare@...cle.com>, <suzuki.poulose@....com>, <tabba@...gle.com>,
	<thomas.lendacky@....com>, <usama.arif@...edance.com>, <vbabka@...e.cz>,
	<viro@...iv.linux.org.uk>, <vkuznets@...hat.com>, <wei.w.wang@...el.com>,
	<will@...nel.org>, <willy@...radead.org>, <xiaoyao.li@...el.com>,
	<yan.y.zhao@...el.com>, <yilun.xu@...el.com>, <yuzenghui@...wei.com>,
	<zhiquan1.li@...el.com>
Subject: Re: [RFC PATCH v2 02/51] KVM: guest_memfd: Introduce and use
 shareability to guard faulting

On Wed, Jul 02, 2025 at 10:10:36PM -0700, Vishal Annapurve wrote:
> On Wed, Jul 2, 2025 at 9:12 PM Michael Roth <michael.roth@....com> wrote:
> >
> > On Wed, Jul 02, 2025 at 05:46:23PM -0700, Vishal Annapurve wrote:
> > > On Wed, Jul 2, 2025 at 4:25 PM Michael Roth <michael.roth@....com> wrote:
> > > >
> > > > On Wed, Jun 11, 2025 at 02:51:38PM -0700, Ackerley Tng wrote:
> > > > > Michael Roth <michael.roth@....com> writes:
> > > > >
> > > > > > On Wed, May 14, 2025 at 04:41:41PM -0700, Ackerley Tng wrote:
> > > > > >> Track guest_memfd memory's shareability status within the inode as
> > > > > >> opposed to the file, since it is property of the guest_memfd's memory
> > > > > >> contents.
> > > > > >>
> > > > > >> Shareability is a property of the memory and is indexed using the
> > > > > >> page's index in the inode. Because shareability is the memory's
> > > > > >> property, it is stored within guest_memfd instead of within KVM, like
> > > > > >> in kvm->mem_attr_array.
> > > > > >>
> > > > > >> KVM_MEMORY_ATTRIBUTE_PRIVATE in kvm->mem_attr_array must still be
> > > > > >> retained to allow VMs to only use guest_memfd for private memory and
> > > > > >> some other memory for shared memory.
> > > > > >>
> > > > > >> Not all use cases require guest_memfd() to be shared with the host
> > > > > >> when first created. Add a new flag, GUEST_MEMFD_FLAG_INIT_PRIVATE,
> > > > > >> which when set on KVM_CREATE_GUEST_MEMFD, initializes the memory as
> > > > > >> private to the guest, and therefore not mappable by the
> > > > > >> host. Otherwise, memory is shared until explicitly converted to
> > > > > >> private.
> > > > > >>
> > > > > >> Signed-off-by: Ackerley Tng <ackerleytng@...gle.com>
> > > > > >> Co-developed-by: Vishal Annapurve <vannapurve@...gle.com>
> > > > > >> Signed-off-by: Vishal Annapurve <vannapurve@...gle.com>
> > > > > >> Co-developed-by: Fuad Tabba <tabba@...gle.com>
> > > > > >> Signed-off-by: Fuad Tabba <tabba@...gle.com>
> > > > > >> Change-Id: If03609cbab3ad1564685c85bdba6dcbb6b240c0f
> > > > > >> ---
> > > > > >>  Documentation/virt/kvm/api.rst |   5 ++
> > > > > >>  include/uapi/linux/kvm.h       |   2 +
> > > > > >>  virt/kvm/guest_memfd.c         | 124 ++++++++++++++++++++++++++++++++-
> > > > > >>  3 files changed, 129 insertions(+), 2 deletions(-)
> > > > > >>
> > > > > >> diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
> > > > > >> index 86f74ce7f12a..f609337ae1c2 100644
> > > > > >> --- a/Documentation/virt/kvm/api.rst
> > > > > >> +++ b/Documentation/virt/kvm/api.rst
> > > > > >> @@ -6408,6 +6408,11 @@ belonging to the slot via its userspace_addr.
> > > > > >>  The use of GUEST_MEMFD_FLAG_SUPPORT_SHARED will not be allowed for CoCo VMs.
> > > > > >>  This is validated when the guest_memfd instance is bound to the VM.
> > > > > >>
> > > > > >> +If the capability KVM_CAP_GMEM_CONVERSIONS is supported, then the 'flags' field
> > > > > >> +supports GUEST_MEMFD_FLAG_INIT_PRIVATE.  Setting GUEST_MEMFD_FLAG_INIT_PRIVATE
> > > > > >> +will initialize the memory for the guest_memfd as guest-only and not faultable
> > > > > >> +by the host.
> > > > > >> +
> > > > > >
> > > > > > KVM_CAP_GMEM_CONVERSION doesn't get introduced until later, so it seems
> > > > > > like this flag should be deferred until that patch is in place. Is it
> > > > > > really needed at that point though? Userspace would be able to set the
> > > > > > initial state via KVM_GMEM_CONVERT_SHARED/PRIVATE ioctls.
> > > > > >
> > > > >
> > > > > I can move this change to the later patch. Thanks! Will fix in the next
> > > > > revision.
> > > > >
> > > > > > The mtree contents seems to get stored in the same manner in either case so
> > > > > > performance-wise only the overhead of a few userspace<->kernel switches
> > > > > > would be saved. Are there any other reasons?
> > > > > >
> > > > > > Otherwise, maybe just settle on SHARED as a documented default (since at
> > > > > > least non-CoCo VMs would be able to reliably benefit) and let
> > > > > > CoCo/GUEST_MEMFD_FLAG_SUPPORT_SHARED VMs set PRIVATE at whatever
> > > > > > granularity makes sense for the architecture/guest configuration.
> > > > > >
> > > > >
> > > > > Because shared pages are split once any memory is allocated, having a
> > > > > way to INIT_PRIVATE could avoid the split and then merge on
> > > > > conversion. I feel that is enough value to have this config flag, what
> > > > > do you think?
> > > > >
> > > > > I guess we could also have userspace be careful not to do any allocation
> > > > > before converting.
> >
> > (Re-visiting this with the assumption that we *don't* intend to use mmap() to
> > populate memory (in which case you can pretty much ignore my previous
> > response))
> 
> I am assuming in-place conversion with huge page backing for the
> discussion below.
> 
> Looks like there are three scenarios/usecases we are discussing here:
> 1) Pre-allocating guest_memfd file offsets
>    - Userspace can use fallocate to do this for hugepages by keeping
> the file ranges marked private.
> 2) Prefaulting guest EPT/NPT entries
> 3) Populating initial guest payload into guest_memfd memory
>    - Userspace can mark certain ranges as shared, populate the
> contents and convert the ranges back to private. So mmap will come in
> handy here.
> 
> >
> > I'm still not sure where the INIT_PRIVATE flag comes into play. For SNP,
> > userspace already defaults to marking everything private pretty close to
> > guest_memfd creation time, so the potential for allocations to occur
> > in-between seems small, but worth confirming.
> 
> Ok, I am not much worried about whether the INIT_PRIVATE flag gets
> supported or not, but more about the default setting that different
> CVMs start with. To me, it looks like all CVMs should start as
> everything private by default and if there is a way to bake that
> configuration during guest_memfd creation time that would be good to
> have instead of doing "create and convert" operations and there is a
> fairly low cost to support this flag.
> 
> >
> > But I know in the past there was a desire to ensure TDX/SNP could
> > support pre-allocating guest_memfd memory (and even pre-faulting via
> > KVM_PRE_FAULT_MEMORY), but I think that could still work right? The
> > fallocate() handling could still avoid the split if the whole hugepage
> > is private, though there is a bit more potential for that fallocate()
> > to happen before userspace does the "manually" shared->private
> > conversion. I'll double-check on that aspect, but otherwise, is there
> > still any other need for it?
> 
> This usecase of being able to preallocate should still work with
> in-place conversion assuming all ranges are private before
> pre-population.

Ok, I think I was missing that the merge logic here will then restore it
to 1GB before the guest starts, so the folio isn't permanently split if
we do the mmap() and that gives us more flexibility on how we can use
it.

I was thinking we needed to avoid the split from the start by avoiding
paths like mmap() which might trigger the split. I was trying to avoid
any merge->unsplit logic in the THP case (or unsplit in general), in
which case we'd get permanent splits via the mmap() approach, but for
2MB that's probably not a big deal.

> 
> >
> > > >
> > > > I assume we do want to support things like preallocating guest memory so
> > > > not sure this approach is feasible to avoid splits.
> > > >
> > > > But I feel like we might be working around a deeper issue here, which is
> > > > that we are pre-emptively splitting anything that *could* be mapped into
> > > > userspace (i.e. allocated+shared/mixed), rather than splitting when
> > > > necessary.
> > > >
> > > > I know that was the plan laid out in the guest_memfd calls, but I've run
> > > > into a couple instances that have me thinking we should revisit this.
> > > >
> > > > 1) Some of the recent guest_memfd seems to be gravitating towards having
> > > >    userspace populate/initialize guest memory payload prior to boot via
> > > >    mmap()'ing the shared guest_memfd pages so things work the same as
> > > >    they would for initialized normal VM memory payload (rather than
> > > >    relying on back-channels in the kernel to user data into guest_memfd
> > > >    pages).
> > > >
> > > >    When you do this though, for an SNP guest at least, that memory
> > > >    acceptance is done in chunks of 4MB (with accept_memory=lazy), and
> > > >    because that will put each 1GB page into an allocated+mixed state,
> > >
> > > I would like your help in understanding why we need to start
> > > guest_memfd ranges as shared for SNP guests. guest_memfd ranges being
> > > private simply should mean that certain ranges are not faultable by
> > > the userspace.
> >
> > It's seeming like I probably misremembered, but I thought there was a
> > discussion on guest_memfd call a month (or so?) ago about whether to
> > continue to use backchannels to populate guest_memfd pages prior to
> > launch. It was in the context of whether to keep using kvm_gmem_populate()
> > for populating guest_memfd pages by copying them in from separate
> > userspace buffer vs. simply populating them directly from userspace.
> > I thought we were leaning on the latter since it was simpler all-around,
> > which is great for SNP since that is already how it populates memory: by
> > writing to it from userspace, which kvm_gmem_populate() then copies into
> > guest_memfd pages. With shared gmem support, we just skip the latter now
> > in the kernel rather needing changes to how userspace handles things in
> > that regard. But maybe that was just wishful thinking :)
> 
> You remember it correctly and that's how userspace should pre-populate
> guest memory contents with in-place conversion support available.
> Userspace can simply do the following scheme as an example:
> 1) Create guest_memfd with the INIT_PRIVATE flag or if we decide to
> not go that way, create a guest_memfd file and set all ranges as
> private.
> 2) Preallocate the guest_memfd ranges.
> 3) Convert the needed ranges to shared, populate the initial guest
> payload and then convert those ranges back to private.
> 
> Important point here is that guest_memfd ranges can be marked as
> private before pre-allocating guest_memfd ranges.

Got it, and then the merge logic triggers so you get the 1GB back before
guest launch. That seems reasonable. I was only thinking of the merge
logic in the context of a running guest and it didn't seem all that useful
in that regard, but it makes perfect sense for the above sort of scenario.

Thanks,

Mike

> 
> >
> > But you raise some very compelling points on why this might not be a
> > good idea even if that was how that discussion went.
> >
> > >
> > > Will following work?
> > > 1) Userspace starts all guest_memfd ranges as private.
> > > 2) During early guest boot it starts issuing PSC requests for
> > > converting memory from shared to private
> > >     -> KVM forwards this request to userspace
> > >     -> Userspace checks that the pages are already private and simply
> > > does nothing.
> > > 3) Pvalidate from guest on that memory will result in guest_memfd
> > > offset query which will cause the RMP table entries to actually get
> > > populated.
> >
> > That would work, but there will need to be changes on userspace to deal
> > with how SNP populates memory pre-boot just like normal VMs do. We will
> > instead need to copy that data into separate buffers, and pass those in
> > as the buffer hva instead of the shared hva corresponding to that GPA.
> 
> Initial guest memory payload generally carries a much smaller
> footprint so I ignored that detail in the above sequence. As I said
> above, userspace should be able to use guest_memfd ranges to directly
> populate contents by converting those ranges to shared.
> 
> >
> > But that seems reasonable if it avoids so many other problems.
> >
> > >
> > > >    we end up splitting every 1GB to 4K and the guest can't even
> > > >    accept/PVALIDATE it 2MB at that point even if userspace doesn't touch
> > > >    anything in the range. As some point the guest will convert/accept
> > > >    the entire range, at which point we could merge, but for SNP we'd
> > > >    need guest cooperation to actually use a higher-granularity in stage2
> > > >    page tables at that point since RMP entries are effectively all split
> > > >    to 4K.
> > > >
> > > >    I understand the intent is to default to private where this wouldn't
> > > >    be an issue, and we could punt to userspace to deal with it, but it
> > > >    feels like an artificial restriction to place on userspace. And if we
> > > >    do want to allow/expect guest_memfd contents to be initialized pre-boot
> > > >    just like normal memory, then userspace would need to jump through
> > > >    some hoops:
> > > >
> > > >    - if defaulting to private: add hooks to convert each range that's being
> > > >      modified to a shared state prior to writing to it
> > >
> > > Why is that a problem?
> >
> > These were only problems if we went the above-mentioned way of
> > populating memory pre-boot via mmap() instead of other backchannels. If
> > we don't do that, then both these things cease to be problems. Sounds goods
> > to me. :)
> 
> I think there wouldn't be a problem even if we pre-populated memory
> pre-boot via mmap(). Using mmap() seems a preferable option to me.
>