linux-kernel - Re: [RFC PATCH v2 00/51] 1G page support for guest

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAGtprH8EMnmvvVir6_U+L5S3SEvrU1OzLrvkL58fXgfg59bjoA@mail.gmail.com>
Date: Fri, 16 May 2025 06:11:56 -0700
From: Vishal Annapurve <vannapurve@...gle.com>
To: "Edgecombe, Rick P" <rick.p.edgecombe@...el.com>
Cc: "seanjc@...gle.com" <seanjc@...gle.com>, "pvorel@...e.cz" <pvorel@...e.cz>, 
	"kvm@...r.kernel.org" <kvm@...r.kernel.org>, "catalin.marinas@....com" <catalin.marinas@....com>, 
	"Miao, Jun" <jun.miao@...el.com>, "Shutemov, Kirill" <kirill.shutemov@...el.com>, 
	"pdurrant@...zon.co.uk" <pdurrant@...zon.co.uk>, "steven.price@....com" <steven.price@....com>, 
	"peterx@...hat.com" <peterx@...hat.com>, "x86@...nel.org" <x86@...nel.org>, 
	"amoorthy@...gle.com" <amoorthy@...gle.com>, "tabba@...gle.com" <tabba@...gle.com>, 
	"quic_svaddagi@...cinc.com" <quic_svaddagi@...cinc.com>, "maz@...nel.org" <maz@...nel.org>, 
	"vkuznets@...hat.com" <vkuznets@...hat.com>, "quic_eberman@...cinc.com" <quic_eberman@...cinc.com>, 
	"keirf@...gle.com" <keirf@...gle.com>, "hughd@...gle.com" <hughd@...gle.com>, 
	"mail@...iej.szmigiero.name" <mail@...iej.szmigiero.name>, "palmer@...belt.com" <palmer@...belt.com>, 
	"Wieczor-Retman, Maciej" <maciej.wieczor-retman@...el.com>, "Zhao, Yan Y" <yan.y.zhao@...el.com>, 
	"ajones@...tanamicro.com" <ajones@...tanamicro.com>, "willy@...radead.org" <willy@...radead.org>, 
	"jack@...e.cz" <jack@...e.cz>, "paul.walmsley@...ive.com" <paul.walmsley@...ive.com>, "aik@....com" <aik@....com>, 
	"usama.arif@...edance.com" <usama.arif@...edance.com>, 
	"quic_mnalajal@...cinc.com" <quic_mnalajal@...cinc.com>, "fvdl@...gle.com" <fvdl@...gle.com>, 
	"rppt@...nel.org" <rppt@...nel.org>, "quic_cvanscha@...cinc.com" <quic_cvanscha@...cinc.com>, 
	"nsaenz@...zon.es" <nsaenz@...zon.es>, "vbabka@...e.cz" <vbabka@...e.cz>, "Du, Fan" <fan.du@...el.com>, 
	"anthony.yznaga@...cle.com" <anthony.yznaga@...cle.com>, 
	"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>, 
	"thomas.lendacky@....com" <thomas.lendacky@....com>, "mic@...ikod.net" <mic@...ikod.net>, 
	"oliver.upton@...ux.dev" <oliver.upton@...ux.dev>, 
	"akpm@...ux-foundation.org" <akpm@...ux-foundation.org>, "bfoster@...hat.com" <bfoster@...hat.com>, 
	"binbin.wu@...ux.intel.com" <binbin.wu@...ux.intel.com>, "muchun.song@...ux.dev" <muchun.song@...ux.dev>, 
	"Li, Zhiquan1" <zhiquan1.li@...el.com>, "rientjes@...gle.com" <rientjes@...gle.com>, 
	"mpe@...erman.id.au" <mpe@...erman.id.au>, "Aktas, Erdem" <erdemaktas@...gle.com>, 
	"david@...hat.com" <david@...hat.com>, "jgg@...pe.ca" <jgg@...pe.ca>, 
	"jhubbard@...dia.com" <jhubbard@...dia.com>, "Xu, Haibo1" <haibo1.xu@...el.com>, 
	"anup@...infault.org" <anup@...infault.org>, "Hansen, Dave" <dave.hansen@...el.com>, 
	"Yamahata, Isaku" <isaku.yamahata@...el.com>, "jthoughton@...gle.com" <jthoughton@...gle.com>, 
	"Wang, Wei W" <wei.w.wang@...el.com>, 
	"steven.sistare@...cle.com" <steven.sistare@...cle.com>, "jarkko@...nel.org" <jarkko@...nel.org>, 
	"quic_pheragu@...cinc.com" <quic_pheragu@...cinc.com>, "chenhuacai@...nel.org" <chenhuacai@...nel.org>, 
	"Huang, Kai" <kai.huang@...el.com>, "shuah@...nel.org" <shuah@...nel.org>, 
	"dwmw@...zon.co.uk" <dwmw@...zon.co.uk>, "pankaj.gupta@....com" <pankaj.gupta@....com>, 
	"Peng, Chao P" <chao.p.peng@...el.com>, "nikunj@....com" <nikunj@....com>, 
	"Graf, Alexander" <graf@...zon.com>, "viro@...iv.linux.org.uk" <viro@...iv.linux.org.uk>, 
	"pbonzini@...hat.com" <pbonzini@...hat.com>, "yuzenghui@...wei.com" <yuzenghui@...wei.com>, 
	"jroedel@...e.de" <jroedel@...e.de>, "suzuki.poulose@....com" <suzuki.poulose@....com>, 
	"jgowans@...zon.com" <jgowans@...zon.com>, "Xu, Yilun" <yilun.xu@...el.com>, 
	"liam.merwick@...cle.com" <liam.merwick@...cle.com>, "michael.roth@....com" <michael.roth@....com>, 
	"quic_tsoni@...cinc.com" <quic_tsoni@...cinc.com>, 
	"richard.weiyang@...il.com" <richard.weiyang@...il.com>, "Weiny, Ira" <ira.weiny@...el.com>, 
	"aou@...s.berkeley.edu" <aou@...s.berkeley.edu>, "Li, Xiaoyao" <xiaoyao.li@...el.com>, 
	"qperret@...gle.com" <qperret@...gle.com>, 
	"kent.overstreet@...ux.dev" <kent.overstreet@...ux.dev>, "dmatlack@...gle.com" <dmatlack@...gle.com>, 
	"james.morse@....com" <james.morse@....com>, "brauner@...nel.org" <brauner@...nel.org>, 
	"ackerleytng@...gle.com" <ackerleytng@...gle.com>, 
	"linux-fsdevel@...r.kernel.org" <linux-fsdevel@...r.kernel.org>, "pgonda@...gle.com" <pgonda@...gle.com>, 
	"quic_pderrin@...cinc.com" <quic_pderrin@...cinc.com>, "roypat@...zon.co.uk" <roypat@...zon.co.uk>, 
	"linux-mm@...ck.org" <linux-mm@...ck.org>, "will@...nel.org" <will@...nel.org>, 
	"hch@...radead.org" <hch@...radead.org>
Subject: Re: [RFC PATCH v2 00/51] 1G page support for guest_memfd

On Thu, May 15, 2025 at 7:12 PM Edgecombe, Rick P
<rick.p.edgecombe@...el.com> wrote:
>
> On Thu, 2025-05-15 at 17:57 -0700, Sean Christopherson wrote:
> > > > > Thinking from the TDX perspective, we might have bigger fish to fry than
> > > > > 1.6% memory savings (for example dynamic PAMT), and the rest of the
> > > > > benefits don't have numbers. How much are we getting for all the
> > > > > complexity, over say buddy allocated 2MB pages?
> >
> > TDX may have bigger fish to fry, but some of us have bigger fish to fry than
> > TDX :-)
>
> Fair enough. But TDX is on the "roadmap". So it helps to say what the target of
> this series is.
>
> >
> > > > This series should work for any page sizes backed by hugetlb memory.
> > > > Non-CoCo VMs, pKVM and Confidential VMs all need hugepages that are
> > > > essential for certain workloads and will emerge as guest_memfd users.
> > > > Features like KHO/memory persistence in addition also depend on
> > > > hugepage support in guest_memfd.
> > > >
> > > > This series takes strides towards making guest_memfd compatible with
> > > > usecases where 1G pages are essential and non-confidential VMs are
> > > > already exercising them.
> > > >
> > > > I think the main complexity here lies in supporting in-place
> > > > conversion which applies to any huge page size even for buddy
> > > > allocated 2MB pages or THP.
> > > >
> > > > This complexity arises because page structs work at a fixed
> > > > granularity, future roadmap towards not having page structs for guest
> > > > memory (at least private memory to begin with) should help towards
> > > > greatly reducing this complexity.
> > > >
> > > > That being said, DPAMT and huge page EPT mappings for TDX VMs remain
> > > > essential and complement this series well for better memory footprint
> > > > and overall performance of TDX VMs.
> > >
> > > Hmm, this didn't really answer my questions about the concrete benefits.
> > >
> > > I think it would help to include this kind of justification for the 1GB
> > > guestmemfd pages. "essential for certain workloads and will emerge" is a bit
> > > hard to review against...
> > >
> > > I think one of the challenges with coco is that it's almost like a sprint to
> > > reimplement virtualization. But enough things are changing at once that not
> > > all of the normal assumptions hold, so it can't copy all the same solutions.
> > > The recent example was that for TDX huge pages we found that normal
> > > promotion paths weren't actually yielding any benefit for surprising TDX
> > > specific reasons.
> > >
> > > On the TDX side we are also, at least currently, unmapping private pages
> > > while they are mapped shared, so any 1GB pages would get split to 2MB if
> > > there are any shared pages in them. I wonder how many 1GB pages there would
> > > be after all the shared pages are converted. At smaller TD sizes, it could
> > > be not much.
> >
> > You're conflating two different things.  guest_memfd allocating and managing
> > 1GiB physical pages, and KVM mapping memory into the guest at 1GiB/2MiB
> > granularity.  Allocating memory in 1GiB chunks is useful even if KVM can only
> > map memory into the guest using 4KiB pages.
>
> I'm aware of the 1.6% vmemmap benefits from the LPC talk. Is there more? The
> list quoted there was more about guest performance. Or maybe the clever page
> table walkers that find contiguous small mappings could benefit guest
> performance too? It's the kind of thing I'd like to see at least broadly called
> out.

The crux of this series really is hugetlb backing support for
guest_memfd and handling CoCo VMs irrespective of the page size as I
suggested earlier, so 2M page sizes will need to handle similar
complexity of in-place conversion.

Google internally uses 1G hugetlb pages to achieve high bandwidth IO,
lower memory footprint using HVO and lower MMU/IOMMU page table memory
footprint among other improvements. These percentages carry a
substantial impact when working at the scale of large fleets of hosts
each carrying significant memory capacity.

guest_memfd hugepage support + hugepage EPT mapping support for TDX
VMs significantly help:
1) ~70% decrease in TDX VM boot up time
2) ~65% decrease in TDX VM shutdown time
3) ~90% decrease in TDX VM PAMT memory overhead
4) Improvement in TDX SEPT memory overhead

And we believe this combination should also help achieve better
performance with TDX connect in future.

Hugetlb huge pages are preferred as they are statically carved out at
boot and so provide much better guarantees of availability. Once the
pages are carved out, any VMs scheduled on such a host will need to
work with the same hugetlb memory sizes. This series attempts to use
hugetlb pages with in-place conversion, avoiding the double allocation
problem that otherwise results in significant memory overheads for
CoCo VMs.

>
> I'm thinking that Google must have a ridiculous amount of learnings about VM
> memory management. And this is probably designed around those learnings. But
> reviewers can't really evaluate it if they don't know the reasons and tradeoffs
> taken. If it's going upstream, I think it should have at least the high level
> reasoning explained.
>
> I don't mean to harp on the point so hard, but I didn't expect it to be
> controversial either.
>
> >
> > > So for TDX in isolation, it seems like jumping out too far ahead to
> > > effectively consider the value. But presumably you guys are testing this on
> > > SEV or something? Have you measured any performance improvement? For what
> > > kind of applications? Or is the idea to basically to make guestmemfd work
> > > like however Google does guest memory?
> >
> > The longer term goal of guest_memfd is to make it suitable for backing all
> > VMs, hence Vishal's "Non-CoCo VMs" comment.
>
> Oh, I actually wasn't aware of this. Or maybe I remember now. I thought he was
> talking about pKVM.
>
> >   Yes, some of this is useful for TDX, but we (and others) want to use
> > guest_memfd for far more than just CoCo VMs.
>
>
> >  And for non-CoCo VMs, 1GiB hugepages are mandatory for various workloads.
> I've heard this a lot. It must be true, but I've never seen the actual numbers.
> For a long time people believed 1GB huge pages on the direct map were critical,
> but then benchmarking on a contemporary CPU couldn't find much difference
> between 2MB and 1GB. I'd expect TDP huge pages to be different than that because
> the combined walks are huge, iTLB, etc, but I'd love to see a real number.