linux-kernel - Re: [RFC PATCH v2 02/51] KVM: guest_memfd: Introduce and use shareability to guard faulting

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAGtprH-fSW219J3gxD3UFLKhSvBj-kqUDezRXPFqTjj90po_xQ@mail.gmail.com>
Date: Sat, 12 Jul 2025 10:53:17 -0700
From: Vishal Annapurve <vannapurve@...gle.com>
To: Michael Roth <michael.roth@....com>
Cc: Ackerley Tng <ackerleytng@...gle.com>, kvm@...r.kernel.org, linux-mm@...ck.org, 
	linux-kernel@...r.kernel.org, x86@...nel.org, linux-fsdevel@...r.kernel.org, 
	aik@....com, ajones@...tanamicro.com, akpm@...ux-foundation.org, 
	amoorthy@...gle.com, anthony.yznaga@...cle.com, anup@...infault.org, 
	aou@...s.berkeley.edu, bfoster@...hat.com, binbin.wu@...ux.intel.com, 
	brauner@...nel.org, catalin.marinas@....com, chao.p.peng@...el.com, 
	chenhuacai@...nel.org, dave.hansen@...el.com, david@...hat.com, 
	dmatlack@...gle.com, dwmw@...zon.co.uk, erdemaktas@...gle.com, 
	fan.du@...el.com, fvdl@...gle.com, graf@...zon.com, haibo1.xu@...el.com, 
	hch@...radead.org, hughd@...gle.com, ira.weiny@...el.com, 
	isaku.yamahata@...el.com, jack@...e.cz, james.morse@....com, 
	jarkko@...nel.org, jgg@...pe.ca, jgowans@...zon.com, jhubbard@...dia.com, 
	jroedel@...e.de, jthoughton@...gle.com, jun.miao@...el.com, 
	kai.huang@...el.com, keirf@...gle.com, kent.overstreet@...ux.dev, 
	kirill.shutemov@...el.com, liam.merwick@...cle.com, 
	maciej.wieczor-retman@...el.com, mail@...iej.szmigiero.name, maz@...nel.org, 
	mic@...ikod.net, mpe@...erman.id.au, muchun.song@...ux.dev, nikunj@....com, 
	nsaenz@...zon.es, oliver.upton@...ux.dev, palmer@...belt.com, 
	pankaj.gupta@....com, paul.walmsley@...ive.com, pbonzini@...hat.com, 
	pdurrant@...zon.co.uk, peterx@...hat.com, pgonda@...gle.com, pvorel@...e.cz, 
	qperret@...gle.com, quic_cvanscha@...cinc.com, quic_eberman@...cinc.com, 
	quic_mnalajal@...cinc.com, quic_pderrin@...cinc.com, quic_pheragu@...cinc.com, 
	quic_svaddagi@...cinc.com, quic_tsoni@...cinc.com, richard.weiyang@...il.com, 
	rick.p.edgecombe@...el.com, rientjes@...gle.com, roypat@...zon.co.uk, 
	rppt@...nel.org, seanjc@...gle.com, shuah@...nel.org, steven.price@....com, 
	steven.sistare@...cle.com, suzuki.poulose@....com, tabba@...gle.com, 
	thomas.lendacky@....com, usama.arif@...edance.com, vbabka@...e.cz, 
	viro@...iv.linux.org.uk, vkuznets@...hat.com, wei.w.wang@...el.com, 
	will@...nel.org, willy@...radead.org, xiaoyao.li@...el.com, 
	yan.y.zhao@...el.com, yilun.xu@...el.com, yuzenghui@...wei.com, 
	zhiquan1.li@...el.com
Subject: Re: [RFC PATCH v2 02/51] KVM: guest_memfd: Introduce and use
 shareability to guard faulting

On Fri, Jul 11, 2025 at 5:11 PM Michael Roth <michael.roth@....com> wrote:
> >
> > Wishful thinking on my part: It would be great to figure out a way to
> > promote these pagetable entries without relying on the guest, if
> > possible with ABI updates, as I think the host should have some
> > control over EPT/NPT granularities even for Confidential VMs. Along
>
> I'm not sure how much it would buy us. For example, for a 2MB hugetlb
> SNP guest boot with 16GB of memory I see 622 2MB hugepages getting
> split, but only about 30 or so of those get merged back to 2MB folios
> during guest run-time. These are presumably the set of 2MB regions we
> could promote back up, but it's not much given that we wouldn't expect
> that value to grow proportionally for larger guests: it's really
> separate things like the number of vCPUs (for shared GHCB pages), number
> of virtio buffers, etc. that end up determining the upper bound on how
> many pages might get split due to 4K private->shared conversion, and
> these would vary all that much from get to get outside maybe vCPU
> count.
>
> For 1GB hugetlb I see about 6 1GB pages get split, and only 2 get merged
> during run-time and would be candidates for promotion.
>

Thanks for the great analysis here. I think we will need to repeat
such analysis for other scenarios such as usage with accelerators.

> This could be greatly improved from the guest side by using
> higher-order allocations to create pools of shared memory that could
> then be used to reduce the number of splits caused by doing
> private->shared conversions on random ranges of malloc'd memory,
> and this could be done even without special promotion support on the
> host for pretty much the entirety of guest memory. The idea there would
> be to just making optimized guests avoid the splits completely, rather
> than relying on the limited subset that hardware can optimize without
> guest cooperation.

Yes, it would be great to improve the situation from the guest side,
e.g. I tried with a rough draft [1], the conclusion there was that we
need to set aside "enough" guest memory as CMA to cause all the DMA go
through 2M aligned buffers. It's hard to figure out how much is
"enough", but we could start somewhere. That being said, the host
still has to manage memory this way by splitting/merging at runtime
because I don't think it's possible to enforce all conversions to
happen at 2M (or any at 1G) granularity. So it's also very likely that
even if guests do significant chunk of conversions at hugepage
granularity, host still needs to split pages all the way to 4K for all
shared regions unless we can bake another restriction in the
conversion ABI that guests can only convert the same ranges to private
as were converted before to shared.

[1] https://lore.kernel.org/lkml/20240112055251.36101-1-vannapurve@google.com/

>
> > the similar lines, it would be great to have "page struct"-less memory
> > working for Confidential VMs, which should greatly reduce the toil
> > with merge/split operations and will render the conversions mostly to
> > be pagetable manipulations.
>
> FWIW, I did some profiling of split/merge vs. overall conversion time
> (by that I mean all cycles spent within kvm_gmem_convert_execute_work()),
> and while split/merge does take quite a few more cycles than your
> average conversion operation (~100x more), the total cycles spent
> splitting/merging ended up being about 7% of the total cycles spent
> handling conversions (1043938460 cycles in this case).
>
> For 1GB, a split/merge take >1000x more than a normal conversion
> operation (46475980 cycles vs 320 in this sample), but it's probably
> still not too bad vs the overall conversion path, and as mentioned above
> it only happens about 6x for 16GB SNP guest so I don't think split/merge
> overhead is a huge deal for current guests, especially if we work toward
> optimizing guest-side usage of shared memory in the future. (There is
> potential for this to crater performance for a very poorly-optimized
> guest however but I think the guest should bear some burden for that
> sort of thing: e.g. flipping the same page back-and-forth between
> shared/private vs. caching it for continued usage as shared page in the
> guest driver path isn't something we should put too much effort into
> optimizing.)
>

As per discussions in the past, guest_memfd private pages are simply
only managed by guest_memfd. We don't need and effectively don't want
the kernel to manage guest private memory. So effectively we can get
rid of page structs in theory just for private pages as well and
allocate page structs only for shared memory on conversion and
deallocate on conversion back to private.

And when we have base core-mm allocators that hand out raw pfns to
start with, we don't even need shared memory ranges to be backed by
page structs.

Few hurdles we need to cross:
1) Invent a new filemap equivalent that maps guest_memfd offsets to pfns
2) Modify TDX EPT management to work with pfns and not page structs
3) Modify generic KVM NPT/EPT management logic to work with pfns and
not rely on page structs
4) Modify memory error/hwpoison handling to route all memory errors on
such pfns to guest_memfd.

I believe there are obvious benefits (reduced complexity, reduced
memory footprint etc) if we go this route and we are very likely to go
this route for future usecases even if we decide to live with
conversion costs today.