linux-kernel - Re: [RFC PATCH v2 00/51] 1G page support for guest

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <24e8ae7483d0fada8d5042f9cd5598573ca8f1c5.camel@intel.com>
Date: Thu, 15 May 2025 23:35:04 +0000
From: "Edgecombe, Rick P" <rick.p.edgecombe@...el.com>
To: "Annapurve, Vishal" <vannapurve@...gle.com>
CC: "palmer@...belt.com" <palmer@...belt.com>, "kvm@...r.kernel.org"
	<kvm@...r.kernel.org>, "catalin.marinas@....com" <catalin.marinas@....com>,
	"Miao, Jun" <jun.miao@...el.com>, "nsaenz@...zon.es" <nsaenz@...zon.es>,
	"pdurrant@...zon.co.uk" <pdurrant@...zon.co.uk>, "vbabka@...e.cz"
	<vbabka@...e.cz>, "peterx@...hat.com" <peterx@...hat.com>, "x86@...nel.org"
	<x86@...nel.org>, "tabba@...gle.com" <tabba@...gle.com>, "keirf@...gle.com"
	<keirf@...gle.com>, "quic_svaddagi@...cinc.com" <quic_svaddagi@...cinc.com>,
	"amoorthy@...gle.com" <amoorthy@...gle.com>, "pvorel@...e.cz"
	<pvorel@...e.cz>, "quic_eberman@...cinc.com" <quic_eberman@...cinc.com>,
	"mail@...iej.szmigiero.name" <mail@...iej.szmigiero.name>,
	"vkuznets@...hat.com" <vkuznets@...hat.com>, "anthony.yznaga@...cle.com"
	<anthony.yznaga@...cle.com>, "Wang, Wei W" <wei.w.wang@...el.com>,
	"jack@...e.cz" <jack@...e.cz>, "Wieczor-Retman, Maciej"
	<maciej.wieczor-retman@...el.com>, "Zhao, Yan Y" <yan.y.zhao@...el.com>,
	"Hansen, Dave" <dave.hansen@...el.com>, "ajones@...tanamicro.com"
	<ajones@...tanamicro.com>, "paul.walmsley@...ive.com"
	<paul.walmsley@...ive.com>, "quic_mnalajal@...cinc.com"
	<quic_mnalajal@...cinc.com>, "aik@....com" <aik@....com>,
	"usama.arif@...edance.com" <usama.arif@...edance.com>, "willy@...radead.org"
	<willy@...radead.org>, "rppt@...nel.org" <rppt@...nel.org>,
	"bfoster@...hat.com" <bfoster@...hat.com>, "quic_cvanscha@...cinc.com"
	<quic_cvanscha@...cinc.com>, "Du, Fan" <fan.du@...el.com>, "fvdl@...gle.com"
	<fvdl@...gle.com>, "linux-kernel@...r.kernel.org"
	<linux-kernel@...r.kernel.org>, "thomas.lendacky@....com"
	<thomas.lendacky@....com>, "mic@...ikod.net" <mic@...ikod.net>,
	"oliver.upton@...ux.dev" <oliver.upton@...ux.dev>,
	"akpm@...ux-foundation.org" <akpm@...ux-foundation.org>,
	"steven.price@....com" <steven.price@....com>, "muchun.song@...ux.dev"
	<muchun.song@...ux.dev>, "binbin.wu@...ux.intel.com"
	<binbin.wu@...ux.intel.com>, "Li, Zhiquan1" <zhiquan1.li@...el.com>,
	"rientjes@...gle.com" <rientjes@...gle.com>, "mpe@...erman.id.au"
	<mpe@...erman.id.au>, "Aktas, Erdem" <erdemaktas@...gle.com>,
	"david@...hat.com" <david@...hat.com>, "jgg@...pe.ca" <jgg@...pe.ca>,
	"hughd@...gle.com" <hughd@...gle.com>, "Xu, Haibo1" <haibo1.xu@...el.com>,
	"jhubbard@...dia.com" <jhubbard@...dia.com>, "anup@...infault.org"
	<anup@...infault.org>, "maz@...nel.org" <maz@...nel.org>, "Yamahata, Isaku"
	<isaku.yamahata@...el.com>, "jthoughton@...gle.com" <jthoughton@...gle.com>,
	"steven.sistare@...cle.com" <steven.sistare@...cle.com>, "jarkko@...nel.org"
	<jarkko@...nel.org>, "quic_pheragu@...cinc.com" <quic_pheragu@...cinc.com>,
	"Shutemov, Kirill" <kirill.shutemov@...el.com>, "chenhuacai@...nel.org"
	<chenhuacai@...nel.org>, "Huang, Kai" <kai.huang@...el.com>,
	"shuah@...nel.org" <shuah@...nel.org>, "dwmw@...zon.co.uk"
	<dwmw@...zon.co.uk>, "pankaj.gupta@....com" <pankaj.gupta@....com>, "Peng,
 Chao P" <chao.p.peng@...el.com>, "nikunj@....com" <nikunj@....com>, "Graf,
 Alexander" <graf@...zon.com>, "viro@...iv.linux.org.uk"
	<viro@...iv.linux.org.uk>, "pbonzini@...hat.com" <pbonzini@...hat.com>,
	"yuzenghui@...wei.com" <yuzenghui@...wei.com>, "jroedel@...e.de"
	<jroedel@...e.de>, "suzuki.poulose@....com" <suzuki.poulose@....com>,
	"jgowans@...zon.com" <jgowans@...zon.com>, "Xu, Yilun" <yilun.xu@...el.com>,
	"liam.merwick@...cle.com" <liam.merwick@...cle.com>, "michael.roth@....com"
	<michael.roth@....com>, "quic_tsoni@...cinc.com" <quic_tsoni@...cinc.com>,
	"richard.weiyang@...il.com" <richard.weiyang@...il.com>, "Weiny, Ira"
	<ira.weiny@...el.com>, "aou@...s.berkeley.edu" <aou@...s.berkeley.edu>, "Li,
 Xiaoyao" <xiaoyao.li@...el.com>, "qperret@...gle.com" <qperret@...gle.com>,
	"kent.overstreet@...ux.dev" <kent.overstreet@...ux.dev>,
	"dmatlack@...gle.com" <dmatlack@...gle.com>, "james.morse@....com"
	<james.morse@....com>, "brauner@...nel.org" <brauner@...nel.org>,
	"roypat@...zon.co.uk" <roypat@...zon.co.uk>, "ackerleytng@...gle.com"
	<ackerleytng@...gle.com>, "linux-fsdevel@...r.kernel.org"
	<linux-fsdevel@...r.kernel.org>, "pgonda@...gle.com" <pgonda@...gle.com>,
	"quic_pderrin@...cinc.com" <quic_pderrin@...cinc.com>, "linux-mm@...ck.org"
	<linux-mm@...ck.org>, "will@...nel.org" <will@...nel.org>,
	"seanjc@...gle.com" <seanjc@...gle.com>, "hch@...radead.org"
	<hch@...radead.org>
Subject: Re: [RFC PATCH v2 00/51] 1G page support for guest_memfd

On Thu, 2025-05-15 at 11:42 -0700, Vishal Annapurve wrote:
> On Thu, May 15, 2025 at 11:03 AM Edgecombe, Rick P
> <rick.p.edgecombe@...el.com> wrote:
> > 
> > On Wed, 2025-05-14 at 16:41 -0700, Ackerley Tng wrote:
> > > Hello,
> > > 
> > > This patchset builds upon discussion at LPC 2024 and many guest_memfd
> > > upstream calls to provide 1G page support for guest_memfd by taking
> > > pages from HugeTLB.
> > 
> > Do you have any more concrete numbers on benefits of 1GB huge pages for
> > guestmemfd/coco VMs? I saw in the LPC talk it has the benefits as:
> > - Increase TLB hit rate and reduce page walks on TLB miss
> > - Improved IO performance
> > - Memory savings of ~1.6% from HugeTLB Vmemmap Optimization (HVO)
> > - Bring guest_memfd to parity with existing VMs that use HugeTLB pages for
> > backing memory
> > 
> > Do you know how often the 1GB TDP mappings get shattered by shared pages?
> > 
> > Thinking from the TDX perspective, we might have bigger fish to fry than 1.6%
> > memory savings (for example dynamic PAMT), and the rest of the benefits don't
> > have numbers. How much are we getting for all the complexity, over say buddy
> > allocated 2MB pages?
> 
> This series should work for any page sizes backed by hugetlb memory.
> Non-CoCo VMs, pKVM and Confidential VMs all need hugepages that are
> essential for certain workloads and will emerge as guest_memfd users.
> Features like KHO/memory persistence in addition also depend on
> hugepage support in guest_memfd.
> 
> This series takes strides towards making guest_memfd compatible with
> usecases where 1G pages are essential and non-confidential VMs are
> already exercising them.
> 
> I think the main complexity here lies in supporting in-place
> conversion which applies to any huge page size even for buddy
> allocated 2MB pages or THP.
> 
> This complexity arises because page structs work at a fixed
> granularity, future roadmap towards not having page structs for guest
> memory (at least private memory to begin with) should help towards
> greatly reducing this complexity.
> 
> That being said, DPAMT and huge page EPT mappings for TDX VMs remain
> essential and complement this series well for better memory footprint
> and overall performance of TDX VMs.

Hmm, this didn't really answer my questions about the concrete benefits.

I think it would help to include this kind of justification for the 1GB
guestmemfd pages. "essential for certain workloads and will emerge" is a bit
hard to review against...

I think one of the challenges with coco is that it's almost like a sprint to
reimplement virtualization. But enough things are changing at once that not all
of the normal assumptions hold, so it can't copy all the same solutions. The
recent example was that for TDX huge pages we found that normal promotion paths
weren't actually yielding any benefit for surprising TDX specific reasons.

On the TDX side we are also, at least currently, unmapping private pages while
they are mapped shared, so any 1GB pages would get split to 2MB if there are any
shared pages in them. I wonder how many 1GB pages there would be after all the
shared pages are converted. At smaller TD sizes, it could be not much.

So for TDX in isolation, it seems like jumping out too far ahead to effectively
consider the value. But presumably you guys are testing this on SEV or
something? Have you measured any performance improvement? For what kind of
applications? Or is the idea to basically to make guestmemfd work like however
Google does guest memory?