linux-kernel - Re: [RFC PATCH v2 02/51] KVM: guest_memfd: Introduce and use shareability to guard faulting

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20250712001055.3in2lnjz6zljydq2@amd.com>
Date: Fri, 11 Jul 2025 19:10:55 -0500
From: Michael Roth <michael.roth@....com>
To: Vishal Annapurve <vannapurve@...gle.com>
CC: Ackerley Tng <ackerleytng@...gle.com>, <kvm@...r.kernel.org>,
	<linux-mm@...ck.org>, <linux-kernel@...r.kernel.org>, <x86@...nel.org>,
	<linux-fsdevel@...r.kernel.org>, <aik@....com>, <ajones@...tanamicro.com>,
	<akpm@...ux-foundation.org>, <amoorthy@...gle.com>,
	<anthony.yznaga@...cle.com>, <anup@...infault.org>, <aou@...s.berkeley.edu>,
	<bfoster@...hat.com>, <binbin.wu@...ux.intel.com>, <brauner@...nel.org>,
	<catalin.marinas@....com>, <chao.p.peng@...el.com>, <chenhuacai@...nel.org>,
	<dave.hansen@...el.com>, <david@...hat.com>, <dmatlack@...gle.com>,
	<dwmw@...zon.co.uk>, <erdemaktas@...gle.com>, <fan.du@...el.com>,
	<fvdl@...gle.com>, <graf@...zon.com>, <haibo1.xu@...el.com>,
	<hch@...radead.org>, <hughd@...gle.com>, <ira.weiny@...el.com>,
	<isaku.yamahata@...el.com>, <jack@...e.cz>, <james.morse@....com>,
	<jarkko@...nel.org>, <jgg@...pe.ca>, <jgowans@...zon.com>,
	<jhubbard@...dia.com>, <jroedel@...e.de>, <jthoughton@...gle.com>,
	<jun.miao@...el.com>, <kai.huang@...el.com>, <keirf@...gle.com>,
	<kent.overstreet@...ux.dev>, <kirill.shutemov@...el.com>,
	<liam.merwick@...cle.com>, <maciej.wieczor-retman@...el.com>,
	<mail@...iej.szmigiero.name>, <maz@...nel.org>, <mic@...ikod.net>,
	<mpe@...erman.id.au>, <muchun.song@...ux.dev>, <nikunj@....com>,
	<nsaenz@...zon.es>, <oliver.upton@...ux.dev>, <palmer@...belt.com>,
	<pankaj.gupta@....com>, <paul.walmsley@...ive.com>, <pbonzini@...hat.com>,
	<pdurrant@...zon.co.uk>, <peterx@...hat.com>, <pgonda@...gle.com>,
	<pvorel@...e.cz>, <qperret@...gle.com>, <quic_cvanscha@...cinc.com>,
	<quic_eberman@...cinc.com>, <quic_mnalajal@...cinc.com>,
	<quic_pderrin@...cinc.com>, <quic_pheragu@...cinc.com>,
	<quic_svaddagi@...cinc.com>, <quic_tsoni@...cinc.com>,
	<richard.weiyang@...il.com>, <rick.p.edgecombe@...el.com>,
	<rientjes@...gle.com>, <roypat@...zon.co.uk>, <rppt@...nel.org>,
	<seanjc@...gle.com>, <shuah@...nel.org>, <steven.price@....com>,
	<steven.sistare@...cle.com>, <suzuki.poulose@....com>, <tabba@...gle.com>,
	<thomas.lendacky@....com>, <usama.arif@...edance.com>, <vbabka@...e.cz>,
	<viro@...iv.linux.org.uk>, <vkuznets@...hat.com>, <wei.w.wang@...el.com>,
	<will@...nel.org>, <willy@...radead.org>, <xiaoyao.li@...el.com>,
	<yan.y.zhao@...el.com>, <yilun.xu@...el.com>, <yuzenghui@...wei.com>,
	<zhiquan1.li@...el.com>
Subject: Re: [RFC PATCH v2 02/51] KVM: guest_memfd: Introduce and use
 shareability to guard faulting

On Mon, Jul 07, 2025 at 07:55:01AM -0700, Vishal Annapurve wrote:
> On Thu, Jul 3, 2025 at 1:41 PM Michael Roth <michael.roth@....com> wrote:
> > > > > > >
> > > > > > > Because shared pages are split once any memory is allocated, having a
> > > > > > > way to INIT_PRIVATE could avoid the split and then merge on
> > > > > > > conversion. I feel that is enough value to have this config flag, what
> > > > > > > do you think?
> > > > > > >
> > > > > > > I guess we could also have userspace be careful not to do any allocation
> > > > > > > before converting.
> > > >
> > > > (Re-visiting this with the assumption that we *don't* intend to use mmap() to
> > > > populate memory (in which case you can pretty much ignore my previous
> > > > response))
> > >
> > > I am assuming in-place conversion with huge page backing for the
> > > discussion below.
> > >
> > > Looks like there are three scenarios/usecases we are discussing here:
> > > 1) Pre-allocating guest_memfd file offsets
> > >    - Userspace can use fallocate to do this for hugepages by keeping
> > > the file ranges marked private.
> > > 2) Prefaulting guest EPT/NPT entries
> > > 3) Populating initial guest payload into guest_memfd memory
> > >    - Userspace can mark certain ranges as shared, populate the
> > > contents and convert the ranges back to private. So mmap will come in
> > > handy here.
> > >
> > > >
> > > > I'm still not sure where the INIT_PRIVATE flag comes into play. For SNP,
> > > > userspace already defaults to marking everything private pretty close to
> > > > guest_memfd creation time, so the potential for allocations to occur
> > > > in-between seems small, but worth confirming.
> > >
> > > Ok, I am not much worried about whether the INIT_PRIVATE flag gets
> > > supported or not, but more about the default setting that different
> > > CVMs start with. To me, it looks like all CVMs should start as
> > > everything private by default and if there is a way to bake that
> > > configuration during guest_memfd creation time that would be good to
> > > have instead of doing "create and convert" operations and there is a
> > > fairly low cost to support this flag.
> > >
> > > >
> > > > But I know in the past there was a desire to ensure TDX/SNP could
> > > > support pre-allocating guest_memfd memory (and even pre-faulting via
> > > > KVM_PRE_FAULT_MEMORY), but I think that could still work right? The
> > > > fallocate() handling could still avoid the split if the whole hugepage
> > > > is private, though there is a bit more potential for that fallocate()
> > > > to happen before userspace does the "manually" shared->private
> > > > conversion. I'll double-check on that aspect, but otherwise, is there
> > > > still any other need for it?
> > >
> > > This usecase of being able to preallocate should still work with
> > > in-place conversion assuming all ranges are private before
> > > pre-population.
> >
> > Ok, I think I was missing that the merge logic here will then restore it
> > to 1GB before the guest starts, so the folio isn't permanently split if
> > we do the mmap() and that gives us more flexibility on how we can use
> > it.
> >
> > I was thinking we needed to avoid the split from the start by avoiding
> > paths like mmap() which might trigger the split. I was trying to avoid
> > any merge->unsplit logic in the THP case (or unsplit in general), in
> > which case we'd get permanent splits via the mmap() approach, but for
> > 2MB that's probably not a big deal.
> 
> After initial payload population, during its runtime guest can cause
> different hugepages to get split which can remain split even after
> guest converts them back to private. For THP there may not be much
> benefit of merging those pages together specially if NPT/EPT entries
> can't be promoted back to hugepage mapping and there is no memory
> penalty as THP doesn't use HVO.
> 
> Wishful thinking on my part: It would be great to figure out a way to
> promote these pagetable entries without relying on the guest, if
> possible with ABI updates, as I think the host should have some
> control over EPT/NPT granularities even for Confidential VMs. Along

I'm not sure how much it would buy us. For example, for a 2MB hugetlb
SNP guest boot with 16GB of memory I see 622 2MB hugepages getting
split, but only about 30 or so of those get merged back to 2MB folios
during guest run-time. These are presumably the set of 2MB regions we
could promote back up, but it's not much given that we wouldn't expect
that value to grow proportionally for larger guests: it's really
separate things like the number of vCPUs (for shared GHCB pages), number
of virtio buffers, etc. that end up determining the upper bound on how
many pages might get split due to 4K private->shared conversion, and
these would vary all that much from get to get outside maybe vCPU
count.

For 1GB hugetlb I see about 6 1GB pages get split, and only 2 get merged
during run-time and would be candidates for promotion.

This could be greatly improved from the guest side by using
higher-order allocations to create pools of shared memory that could
then be used to reduce the number of splits caused by doing
private->shared conversions on random ranges of malloc'd memory,
and this could be done even without special promotion support on the
host for pretty much the entirety of guest memory. The idea there would
be to just making optimized guests avoid the splits completely, rather
than relying on the limited subset that hardware can optimize without
guest cooperation.

> the similar lines, it would be great to have "page struct"-less memory
> working for Confidential VMs, which should greatly reduce the toil
> with merge/split operations and will render the conversions mostly to
> be pagetable manipulations.

FWIW, I did some profiling of split/merge vs. overall conversion time
(by that I mean all cycles spent within kvm_gmem_convert_execute_work()),
and while split/merge does take quite a few more cycles than your
average conversion operation (~100x more), the total cycles spent
splitting/merging ended up being about 7% of the total cycles spent
handling conversions (1043938460 cycles in this case).

For 1GB, a split/merge take >1000x more than a normal conversion
operation (46475980 cycles vs 320 in this sample), but it's probably 
still not too bad vs the overall conversion path, and as mentioned above
it only happens about 6x for 16GB SNP guest so I don't think split/merge
overhead is a huge deal for current guests, especially if we work toward
optimizing guest-side usage of shared memory in the future. (There is
potential for this to crater performance for a very poorly-optimized
guest however but I think the guest should bear some burden for that
sort of thing: e.g. flipping the same page back-and-forth between
shared/private vs. caching it for continued usage as shared page in the
guest driver path isn't something we should put too much effort into
optimizing.)

> 
> That being said, memory split and merge seem to be relatively
> lightweight for THP (with no memory allocation/freeing) and reusing
> the memory files after reboot of the guest VM will require pages to be
> merged to start with a clean slate. One possible option is to always
> merge as early as possible, second option is to invent a new UAPI to
> do it on demand.
> 
> For 1G pages, even if we go with 1G -> 2M -> 4K split stages, page
> splits result in higher memory usage with HVO around and it becomes
> useful to merge them back as early as possible as guest proceeds to
> convert subranges of different hugepages over its lifetime. Merging
> pages as early as possible also allows reusing of memory files during
> the next reboot without having to invent a new UAPI.
> 
> Caveats with "merge as early as possible":
> - Shared to private conversions will be slower for hugetlb pages.
>    * Counter argument: These conversions are already slow as we need
> safe refcounts to reach on the ranges getting converted.
> - If guests convert a particular range often then extra merge/split
> operations will result in overhead.
>    * Counter argument: Since conversions are anyways slow, it's
> beneficial for guests to avoid such a scenario and keep back and forth
> conversions as less frequent as possible.

Fair enough. I'm not seeing any major reason not to do things this way,
as the overhead doesn't seem to be very significant for the common case.

(even though, as noted above, the amount of hugetlb pages we actually end
up merging at guest run-time seems to be fairly small, but maybe there
are scenarios where this will have a bigger impact, and it certainly helps
to have it there for the pre-boot merges.)

-Mike