linux-kernel - Re: [RFC PATCH v2 00/51] 1G page support for guest

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <7d3b391f3a31396bd9abe641259392fd94b5e72f.camel@intel.com>
Date: Fri, 16 May 2025 02:12:00 +0000
From: "Edgecombe, Rick P" <rick.p.edgecombe@...el.com>
To: "seanjc@...gle.com" <seanjc@...gle.com>
CC: "pvorel@...e.cz" <pvorel@...e.cz>, "kvm@...r.kernel.org"
	<kvm@...r.kernel.org>, "catalin.marinas@....com" <catalin.marinas@....com>,
	"Miao, Jun" <jun.miao@...el.com>, "Shutemov, Kirill"
	<kirill.shutemov@...el.com>, "pdurrant@...zon.co.uk" <pdurrant@...zon.co.uk>,
	"steven.price@....com" <steven.price@....com>, "peterx@...hat.com"
	<peterx@...hat.com>, "x86@...nel.org" <x86@...nel.org>, "amoorthy@...gle.com"
	<amoorthy@...gle.com>, "tabba@...gle.com" <tabba@...gle.com>,
	"quic_svaddagi@...cinc.com" <quic_svaddagi@...cinc.com>, "maz@...nel.org"
	<maz@...nel.org>, "vkuznets@...hat.com" <vkuznets@...hat.com>,
	"quic_eberman@...cinc.com" <quic_eberman@...cinc.com>, "keirf@...gle.com"
	<keirf@...gle.com>, "hughd@...gle.com" <hughd@...gle.com>, "Annapurve,
 Vishal" <vannapurve@...gle.com>, "mail@...iej.szmigiero.name"
	<mail@...iej.szmigiero.name>, "palmer@...belt.com" <palmer@...belt.com>,
	"Wieczor-Retman, Maciej" <maciej.wieczor-retman@...el.com>, "Zhao, Yan Y"
	<yan.y.zhao@...el.com>, "ajones@...tanamicro.com" <ajones@...tanamicro.com>,
	"willy@...radead.org" <willy@...radead.org>, "jack@...e.cz" <jack@...e.cz>,
	"paul.walmsley@...ive.com" <paul.walmsley@...ive.com>, "aik@....com"
	<aik@....com>, "usama.arif@...edance.com" <usama.arif@...edance.com>,
	"quic_mnalajal@...cinc.com" <quic_mnalajal@...cinc.com>, "fvdl@...gle.com"
	<fvdl@...gle.com>, "rppt@...nel.org" <rppt@...nel.org>,
	"quic_cvanscha@...cinc.com" <quic_cvanscha@...cinc.com>, "nsaenz@...zon.es"
	<nsaenz@...zon.es>, "vbabka@...e.cz" <vbabka@...e.cz>, "Du, Fan"
	<fan.du@...el.com>, "anthony.yznaga@...cle.com" <anthony.yznaga@...cle.com>,
	"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
	"thomas.lendacky@....com" <thomas.lendacky@....com>, "mic@...ikod.net"
	<mic@...ikod.net>, "oliver.upton@...ux.dev" <oliver.upton@...ux.dev>,
	"akpm@...ux-foundation.org" <akpm@...ux-foundation.org>, "bfoster@...hat.com"
	<bfoster@...hat.com>, "binbin.wu@...ux.intel.com"
	<binbin.wu@...ux.intel.com>, "muchun.song@...ux.dev" <muchun.song@...ux.dev>,
	"Li, Zhiquan1" <zhiquan1.li@...el.com>, "rientjes@...gle.com"
	<rientjes@...gle.com>, "mpe@...erman.id.au" <mpe@...erman.id.au>, "Aktas,
 Erdem" <erdemaktas@...gle.com>, "david@...hat.com" <david@...hat.com>,
	"jgg@...pe.ca" <jgg@...pe.ca>, "jhubbard@...dia.com" <jhubbard@...dia.com>,
	"Xu, Haibo1" <haibo1.xu@...el.com>, "anup@...infault.org"
	<anup@...infault.org>, "Hansen, Dave" <dave.hansen@...el.com>, "Yamahata,
 Isaku" <isaku.yamahata@...el.com>, "jthoughton@...gle.com"
	<jthoughton@...gle.com>, "Wang, Wei W" <wei.w.wang@...el.com>,
	"steven.sistare@...cle.com" <steven.sistare@...cle.com>, "jarkko@...nel.org"
	<jarkko@...nel.org>, "quic_pheragu@...cinc.com" <quic_pheragu@...cinc.com>,
	"chenhuacai@...nel.org" <chenhuacai@...nel.org>, "Huang, Kai"
	<kai.huang@...el.com>, "shuah@...nel.org" <shuah@...nel.org>,
	"dwmw@...zon.co.uk" <dwmw@...zon.co.uk>, "pankaj.gupta@....com"
	<pankaj.gupta@....com>, "Peng, Chao P" <chao.p.peng@...el.com>,
	"nikunj@....com" <nikunj@....com>, "Graf, Alexander" <graf@...zon.com>,
	"viro@...iv.linux.org.uk" <viro@...iv.linux.org.uk>, "pbonzini@...hat.com"
	<pbonzini@...hat.com>, "yuzenghui@...wei.com" <yuzenghui@...wei.com>,
	"jroedel@...e.de" <jroedel@...e.de>, "suzuki.poulose@....com"
	<suzuki.poulose@....com>, "jgowans@...zon.com" <jgowans@...zon.com>, "Xu,
 Yilun" <yilun.xu@...el.com>, "liam.merwick@...cle.com"
	<liam.merwick@...cle.com>, "michael.roth@....com" <michael.roth@....com>,
	"quic_tsoni@...cinc.com" <quic_tsoni@...cinc.com>,
	"richard.weiyang@...il.com" <richard.weiyang@...il.com>, "Weiny, Ira"
	<ira.weiny@...el.com>, "aou@...s.berkeley.edu" <aou@...s.berkeley.edu>, "Li,
 Xiaoyao" <xiaoyao.li@...el.com>, "qperret@...gle.com" <qperret@...gle.com>,
	"kent.overstreet@...ux.dev" <kent.overstreet@...ux.dev>,
	"dmatlack@...gle.com" <dmatlack@...gle.com>, "james.morse@....com"
	<james.morse@....com>, "brauner@...nel.org" <brauner@...nel.org>,
	"ackerleytng@...gle.com" <ackerleytng@...gle.com>,
	"linux-fsdevel@...r.kernel.org" <linux-fsdevel@...r.kernel.org>,
	"pgonda@...gle.com" <pgonda@...gle.com>, "quic_pderrin@...cinc.com"
	<quic_pderrin@...cinc.com>, "roypat@...zon.co.uk" <roypat@...zon.co.uk>,
	"linux-mm@...ck.org" <linux-mm@...ck.org>, "will@...nel.org"
	<will@...nel.org>, "hch@...radead.org" <hch@...radead.org>
Subject: Re: [RFC PATCH v2 00/51] 1G page support for guest_memfd

On Thu, 2025-05-15 at 17:57 -0700, Sean Christopherson wrote:
> > > > Thinking from the TDX perspective, we might have bigger fish to fry than
> > > > 1.6% memory savings (for example dynamic PAMT), and the rest of the
> > > > benefits don't have numbers. How much are we getting for all the
> > > > complexity, over say buddy allocated 2MB pages?
> 
> TDX may have bigger fish to fry, but some of us have bigger fish to fry than
> TDX :-)

Fair enough. But TDX is on the "roadmap". So it helps to say what the target of
this series is.

> 
> > > This series should work for any page sizes backed by hugetlb memory.
> > > Non-CoCo VMs, pKVM and Confidential VMs all need hugepages that are
> > > essential for certain workloads and will emerge as guest_memfd users.
> > > Features like KHO/memory persistence in addition also depend on
> > > hugepage support in guest_memfd.
> > > 
> > > This series takes strides towards making guest_memfd compatible with
> > > usecases where 1G pages are essential and non-confidential VMs are
> > > already exercising them.
> > > 
> > > I think the main complexity here lies in supporting in-place
> > > conversion which applies to any huge page size even for buddy
> > > allocated 2MB pages or THP.
> > > 
> > > This complexity arises because page structs work at a fixed
> > > granularity, future roadmap towards not having page structs for guest
> > > memory (at least private memory to begin with) should help towards
> > > greatly reducing this complexity.
> > > 
> > > That being said, DPAMT and huge page EPT mappings for TDX VMs remain
> > > essential and complement this series well for better memory footprint
> > > and overall performance of TDX VMs.
> > 
> > Hmm, this didn't really answer my questions about the concrete benefits.
> > 
> > I think it would help to include this kind of justification for the 1GB
> > guestmemfd pages. "essential for certain workloads and will emerge" is a bit
> > hard to review against...
> > 
> > I think one of the challenges with coco is that it's almost like a sprint to
> > reimplement virtualization. But enough things are changing at once that not
> > all of the normal assumptions hold, so it can't copy all the same solutions.
> > The recent example was that for TDX huge pages we found that normal
> > promotion paths weren't actually yielding any benefit for surprising TDX
> > specific reasons.
> > 
> > On the TDX side we are also, at least currently, unmapping private pages
> > while they are mapped shared, so any 1GB pages would get split to 2MB if
> > there are any shared pages in them. I wonder how many 1GB pages there would
> > be after all the shared pages are converted. At smaller TD sizes, it could
> > be not much.
> 
> You're conflating two different things.  guest_memfd allocating and managing
> 1GiB physical pages, and KVM mapping memory into the guest at 1GiB/2MiB
> granularity.  Allocating memory in 1GiB chunks is useful even if KVM can only
> map memory into the guest using 4KiB pages.

I'm aware of the 1.6% vmemmap benefits from the LPC talk. Is there more? The
list quoted there was more about guest performance. Or maybe the clever page
table walkers that find contiguous small mappings could benefit guest
performance too? It's the kind of thing I'd like to see at least broadly called
out.

I'm thinking that Google must have a ridiculous amount of learnings about VM
memory management. And this is probably designed around those learnings. But
reviewers can't really evaluate it if they don't know the reasons and tradeoffs
taken. If it's going upstream, I think it should have at least the high level
reasoning explained.

I don't mean to harp on the point so hard, but I didn't expect it to be
controversial either.

> 
> > So for TDX in isolation, it seems like jumping out too far ahead to
> > effectively consider the value. But presumably you guys are testing this on
> > SEV or something? Have you measured any performance improvement? For what
> > kind of applications? Or is the idea to basically to make guestmemfd work
> > like however Google does guest memory?
> 
> The longer term goal of guest_memfd is to make it suitable for backing all
> VMs, hence Vishal's "Non-CoCo VMs" comment.

Oh, I actually wasn't aware of this. Or maybe I remember now. I thought he was
talking about pKVM.

>   Yes, some of this is useful for TDX, but we (and others) want to use
> guest_memfd for far more than just CoCo VMs. 


>  And for non-CoCo VMs, 1GiB hugepages are mandatory for various workloads.
I've heard this a lot. It must be true, but I've never seen the actual numbers.
For a long time people believed 1GB huge pages on the direct map were critical,
but then benchmarking on a contemporary CPU couldn't find much difference
between 2MB and 1GB. I'd expect TDP huge pages to be different than that because
the combined walks are huge, iTLB, etc, but I'd love to see a real number.