[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <ce15353884bd67cc2608d36ef40a178a8d140333.camel@intel.com>
Date: Fri, 16 May 2025 16:45:58 +0000
From: "Edgecombe, Rick P" <rick.p.edgecombe@...el.com>
To: "Annapurve, Vishal" <vannapurve@...gle.com>
CC: "pvorel@...e.cz" <pvorel@...e.cz>, "kvm@...r.kernel.org"
<kvm@...r.kernel.org>, "catalin.marinas@....com" <catalin.marinas@....com>,
"Miao, Jun" <jun.miao@...el.com>, "palmer@...belt.com" <palmer@...belt.com>,
"pdurrant@...zon.co.uk" <pdurrant@...zon.co.uk>, "vbabka@...e.cz"
<vbabka@...e.cz>, "peterx@...hat.com" <peterx@...hat.com>, "x86@...nel.org"
<x86@...nel.org>, "amoorthy@...gle.com" <amoorthy@...gle.com>, "jack@...e.cz"
<jack@...e.cz>, "maz@...nel.org" <maz@...nel.org>, "tabba@...gle.com"
<tabba@...gle.com>, "vkuznets@...hat.com" <vkuznets@...hat.com>,
"quic_svaddagi@...cinc.com" <quic_svaddagi@...cinc.com>,
"mail@...iej.szmigiero.name" <mail@...iej.szmigiero.name>, "hughd@...gle.com"
<hughd@...gle.com>, "quic_eberman@...cinc.com" <quic_eberman@...cinc.com>,
"Wang, Wei W" <wei.w.wang@...el.com>, "keirf@...gle.com" <keirf@...gle.com>,
"Wieczor-Retman, Maciej" <maciej.wieczor-retman@...el.com>, "Zhao, Yan Y"
<yan.y.zhao@...el.com>, "Hansen, Dave" <dave.hansen@...el.com>,
"ajones@...tanamicro.com" <ajones@...tanamicro.com>, "rppt@...nel.org"
<rppt@...nel.org>, "quic_mnalajal@...cinc.com" <quic_mnalajal@...cinc.com>,
"aik@....com" <aik@....com>, "usama.arif@...edance.com"
<usama.arif@...edance.com>, "fvdl@...gle.com" <fvdl@...gle.com>,
"paul.walmsley@...ive.com" <paul.walmsley@...ive.com>,
"quic_cvanscha@...cinc.com" <quic_cvanscha@...cinc.com>, "nsaenz@...zon.es"
<nsaenz@...zon.es>, "willy@...radead.org" <willy@...radead.org>, "Du, Fan"
<fan.du@...el.com>, "anthony.yznaga@...cle.com" <anthony.yznaga@...cle.com>,
"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
"thomas.lendacky@....com" <thomas.lendacky@....com>, "mic@...ikod.net"
<mic@...ikod.net>, "oliver.upton@...ux.dev" <oliver.upton@...ux.dev>,
"Shutemov, Kirill" <kirill.shutemov@...el.com>, "akpm@...ux-foundation.org"
<akpm@...ux-foundation.org>, "steven.price@....com" <steven.price@....com>,
"binbin.wu@...ux.intel.com" <binbin.wu@...ux.intel.com>,
"muchun.song@...ux.dev" <muchun.song@...ux.dev>, "Li, Zhiquan1"
<zhiquan1.li@...el.com>, "rientjes@...gle.com" <rientjes@...gle.com>,
"mpe@...erman.id.au" <mpe@...erman.id.au>, "Aktas, Erdem"
<erdemaktas@...gle.com>, "david@...hat.com" <david@...hat.com>,
"jgg@...pe.ca" <jgg@...pe.ca>, "bfoster@...hat.com" <bfoster@...hat.com>,
"jhubbard@...dia.com" <jhubbard@...dia.com>, "Xu, Haibo1"
<haibo1.xu@...el.com>, "anup@...infault.org" <anup@...infault.org>,
"Yamahata, Isaku" <isaku.yamahata@...el.com>, "jthoughton@...gle.com"
<jthoughton@...gle.com>, "will@...nel.org" <will@...nel.org>,
"steven.sistare@...cle.com" <steven.sistare@...cle.com>,
"quic_pheragu@...cinc.com" <quic_pheragu@...cinc.com>, "jarkko@...nel.org"
<jarkko@...nel.org>, "chenhuacai@...nel.org" <chenhuacai@...nel.org>, "Huang,
Kai" <kai.huang@...el.com>, "shuah@...nel.org" <shuah@...nel.org>,
"dwmw@...zon.co.uk" <dwmw@...zon.co.uk>, "pankaj.gupta@....com"
<pankaj.gupta@....com>, "Peng, Chao P" <chao.p.peng@...el.com>,
"nikunj@....com" <nikunj@....com>, "Graf, Alexander" <graf@...zon.com>,
"viro@...iv.linux.org.uk" <viro@...iv.linux.org.uk>, "pbonzini@...hat.com"
<pbonzini@...hat.com>, "yuzenghui@...wei.com" <yuzenghui@...wei.com>,
"jroedel@...e.de" <jroedel@...e.de>, "suzuki.poulose@....com"
<suzuki.poulose@....com>, "jgowans@...zon.com" <jgowans@...zon.com>, "Xu,
Yilun" <yilun.xu@...el.com>, "liam.merwick@...cle.com"
<liam.merwick@...cle.com>, "michael.roth@....com" <michael.roth@....com>,
"quic_tsoni@...cinc.com" <quic_tsoni@...cinc.com>,
"richard.weiyang@...il.com" <richard.weiyang@...il.com>, "Weiny, Ira"
<ira.weiny@...el.com>, "aou@...s.berkeley.edu" <aou@...s.berkeley.edu>, "Li,
Xiaoyao" <xiaoyao.li@...el.com>, "qperret@...gle.com" <qperret@...gle.com>,
"kent.overstreet@...ux.dev" <kent.overstreet@...ux.dev>,
"dmatlack@...gle.com" <dmatlack@...gle.com>, "james.morse@....com"
<james.morse@....com>, "brauner@...nel.org" <brauner@...nel.org>,
"ackerleytng@...gle.com" <ackerleytng@...gle.com>,
"linux-fsdevel@...r.kernel.org" <linux-fsdevel@...r.kernel.org>,
"pgonda@...gle.com" <pgonda@...gle.com>, "quic_pderrin@...cinc.com"
<quic_pderrin@...cinc.com>, "roypat@...zon.co.uk" <roypat@...zon.co.uk>,
"linux-mm@...ck.org" <linux-mm@...ck.org>, "seanjc@...gle.com"
<seanjc@...gle.com>, "hch@...radead.org" <hch@...radead.org>
Subject: Re: [RFC PATCH v2 00/51] 1G page support for guest_memfd
On Fri, 2025-05-16 at 06:11 -0700, Vishal Annapurve wrote:
> The crux of this series really is hugetlb backing support for
> guest_memfd and handling CoCo VMs irrespective of the page size as I
> suggested earlier, so 2M page sizes will need to handle similar
> complexity of in-place conversion.
I assumed this part was added 1GB complexity:
mm/hugetlb.c | 488 ++---
I'll dig into the series and try to understand the point better.
>
> Google internally uses 1G hugetlb pages to achieve high bandwidth IO,
> lower memory footprint using HVO and lower MMU/IOMMU page table memory
> footprint among other improvements. These percentages carry a
> substantial impact when working at the scale of large fleets of hosts
> each carrying significant memory capacity.
There must have been a lot of measuring involved in that. But the numbers I was
hoping for were how much does *this* series help upstream.
>
> guest_memfd hugepage support + hugepage EPT mapping support for TDX
> VMs significantly help:
> 1) ~70% decrease in TDX VM boot up time
> 2) ~65% decrease in TDX VM shutdown time
> 3) ~90% decrease in TDX VM PAMT memory overhead
> 4) Improvement in TDX SEPT memory overhead
Thanks. It is the difference between 4k mappings and 2MB mappings I guess? Or
are you saying this is the difference between 1GB contiguous pages for TDX at
2MB mapping, and 2MB contiguous pages at TDX 2MB mappings? The 1GB part is the
one I was curious about.
>
> And we believe this combination should also help achieve better
> performance with TDX connect in future.
Please don't take this query as an objection that the series doesn't help TDX
enough or something like that. If it doesn't help TDX at all (not the case),
that is fine. The objection is only that the specific benefits and tradeoffs
around 1GB pages are not clear in the upstream posting.
>
> Hugetlb huge pages are preferred as they are statically carved out at
> boot and so provide much better guarantees of availability.
>
Reserved memory can provide physically contiguous pages more frequently. Seems
not surprising at all, and something that could have a number.
> Once the
> pages are carved out, any VMs scheduled on such a host will need to
> work with the same hugetlb memory sizes. This series attempts to use
> hugetlb pages with in-place conversion, avoiding the double allocation
> problem that otherwise results in significant memory overheads for
> CoCo VMs.
I asked this question assuming there were some measurements for the 1GB part of
this series. It sounds like the reasoning is instead that this is how Google
does things, which is backed by way more benchmarking than kernel patches are
used to getting. So it can just be reasonable assumed to be helpful.
But for upstream code, I'd expect there to be a bit more concrete than "we
believe" and "substantial impact". It seems like I'm in the minority here
though. So if no one else wants to pressure test the thinking in the usual way,
I guess I'll just have to wonder.
Powered by blists - more mailing lists