[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <aG1dbD2Xnpi_Cqf_@google.com>
Date: Tue, 8 Jul 2025 11:03:24 -0700
From: Sean Christopherson <seanjc@...gle.com>
To: Rick P Edgecombe <rick.p.edgecombe@...el.com>
Cc: Vishal Annapurve <vannapurve@...gle.com>, "pvorel@...e.cz" <pvorel@...e.cz>,
"kvm@...r.kernel.org" <kvm@...r.kernel.org>, "catalin.marinas@....com" <catalin.marinas@....com>,
Jun Miao <jun.miao@...el.com>, "nsaenz@...zon.es" <nsaenz@...zon.es>,
Kirill Shutemov <kirill.shutemov@...el.com>, "pdurrant@...zon.co.uk" <pdurrant@...zon.co.uk>,
"peterx@...hat.com" <peterx@...hat.com>, "x86@...nel.org" <x86@...nel.org>,
"tabba@...gle.com" <tabba@...gle.com>, "amoorthy@...gle.com" <amoorthy@...gle.com>,
"quic_svaddagi@...cinc.com" <quic_svaddagi@...cinc.com>, "jack@...e.cz" <jack@...e.cz>,
"vkuznets@...hat.com" <vkuznets@...hat.com>, "quic_eberman@...cinc.com" <quic_eberman@...cinc.com>,
"keirf@...gle.com" <keirf@...gle.com>,
"mail@...iej.szmigiero.name" <mail@...iej.szmigiero.name>,
"anthony.yznaga@...cle.com" <anthony.yznaga@...cle.com>, Wei W Wang <wei.w.wang@...el.com>,
"palmer@...belt.com" <palmer@...belt.com>,
"Wieczor-Retman, Maciej" <maciej.wieczor-retman@...el.com>, Yan Y Zhao <yan.y.zhao@...el.com>,
"ajones@...tanamicro.com" <ajones@...tanamicro.com>, "willy@...radead.org" <willy@...radead.org>,
"paul.walmsley@...ive.com" <paul.walmsley@...ive.com>, Dave Hansen <dave.hansen@...el.com>,
"aik@....com" <aik@....com>, "usama.arif@...edance.com" <usama.arif@...edance.com>,
"quic_mnalajal@...cinc.com" <quic_mnalajal@...cinc.com>, "fvdl@...gle.com" <fvdl@...gle.com>,
"rppt@...nel.org" <rppt@...nel.org>, "quic_cvanscha@...cinc.com" <quic_cvanscha@...cinc.com>,
"maz@...nel.org" <maz@...nel.org>, "vbabka@...e.cz" <vbabka@...e.cz>,
"anup@...infault.org" <anup@...infault.org>, "thomas.lendacky@....com" <thomas.lendacky@....com>,
"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>, "mic@...ikod.net" <mic@...ikod.net>,
"oliver.upton@...ux.dev" <oliver.upton@...ux.dev>, Fan Du <fan.du@...el.com>,
"akpm@...ux-foundation.org" <akpm@...ux-foundation.org>, "steven.price@....com" <steven.price@....com>,
"muchun.song@...ux.dev" <muchun.song@...ux.dev>,
"binbin.wu@...ux.intel.com" <binbin.wu@...ux.intel.com>, Zhiquan1 Li <zhiquan1.li@...el.com>,
"rientjes@...gle.com" <rientjes@...gle.com>, "mpe@...erman.id.au" <mpe@...erman.id.au>,
Erdem Aktas <erdemaktas@...gle.com>, "david@...hat.com" <david@...hat.com>, "jgg@...pe.ca" <jgg@...pe.ca>,
"hughd@...gle.com" <hughd@...gle.com>, "jhubbard@...dia.com" <jhubbard@...dia.com>, Haibo1 Xu <haibo1.xu@...el.com>,
Isaku Yamahata <isaku.yamahata@...el.com>, "jthoughton@...gle.com" <jthoughton@...gle.com>,
"steven.sistare@...cle.com" <steven.sistare@...cle.com>,
"quic_pheragu@...cinc.com" <quic_pheragu@...cinc.com>, "jarkko@...nel.org" <jarkko@...nel.org>,
"chenhuacai@...nel.org" <chenhuacai@...nel.org>, Kai Huang <kai.huang@...el.com>,
"shuah@...nel.org" <shuah@...nel.org>, "bfoster@...hat.com" <bfoster@...hat.com>,
"dwmw@...zon.co.uk" <dwmw@...zon.co.uk>, Chao P Peng <chao.p.peng@...el.com>,
"pankaj.gupta@....com" <pankaj.gupta@....com>, Alexander Graf <graf@...zon.com>,
"nikunj@....com" <nikunj@....com>, "viro@...iv.linux.org.uk" <viro@...iv.linux.org.uk>,
"pbonzini@...hat.com" <pbonzini@...hat.com>, "yuzenghui@...wei.com" <yuzenghui@...wei.com>,
"jroedel@...e.de" <jroedel@...e.de>, "suzuki.poulose@....com" <suzuki.poulose@....com>,
"jgowans@...zon.com" <jgowans@...zon.com>, Yilun Xu <yilun.xu@...el.com>,
"liam.merwick@...cle.com" <liam.merwick@...cle.com>, "michael.roth@....com" <michael.roth@....com>,
"quic_tsoni@...cinc.com" <quic_tsoni@...cinc.com>, Xiaoyao Li <xiaoyao.li@...el.com>,
"aou@...s.berkeley.edu" <aou@...s.berkeley.edu>, Ira Weiny <ira.weiny@...el.com>,
"richard.weiyang@...il.com" <richard.weiyang@...il.com>,
"kent.overstreet@...ux.dev" <kent.overstreet@...ux.dev>, "qperret@...gle.com" <qperret@...gle.com>,
"dmatlack@...gle.com" <dmatlack@...gle.com>, "james.morse@....com" <james.morse@....com>,
"brauner@...nel.org" <brauner@...nel.org>, "roypat@...zon.co.uk" <roypat@...zon.co.uk>,
"linux-fsdevel@...r.kernel.org" <linux-fsdevel@...r.kernel.org>,
"ackerleytng@...gle.com" <ackerleytng@...gle.com>, "pgonda@...gle.com" <pgonda@...gle.com>,
"quic_pderrin@...cinc.com" <quic_pderrin@...cinc.com>, "hch@...radead.org" <hch@...radead.org>,
"will@...nel.org" <will@...nel.org>, "linux-mm@...ck.org" <linux-mm@...ck.org>
Subject: Re: [RFC PATCH v2 00/51] 1G page support for guest_memfd
On Tue, Jul 08, 2025, Rick P Edgecombe wrote:
> On Tue, 2025-07-08 at 10:16 -0700, Vishal Annapurve wrote:
> > > Right, I read that. I still don't see why pKVM needs to do normal
> > > private/shared
> > > conversion for data provisioning. Vs a dedicated operation/flag to make it a
> > > special case.
> >
> > It's dictated by pKVM usecases, memory contents need to be preserved
> > for every conversion not just for initial payload population.
>
> We are weighing pros/cons between:
> - Unifying this uABI across all gmemfd VM types
> - Userspace for one VM type passing a flag for it's special non-shared use case
>
> I don't see how passing a flag or not is dictated by pKVM use case.
Yep. Baking the behavior of a single usecase into the kernel's ABI is rarely a
good idea. Just because pKVM's current usecases always wants contents to be
preserved doesn't mean that pKVM will never change.
As a general rule, KVM should push policy to userspace whenever possible.
> P.S. This doesn't really impact TDX I think. Except that TDX development needs
> to work in the code without bumping anything. So just wishing to work in code
> with less conditionals.
>
> >
> > >
> > > I'm trying to suggest there could be a benefit to making all gmem VM types
> > > behave the same. If conversions are always content preserving for pKVM, why
> > > can't userspace always use the operation that says preserve content? Vs
> > > changing the behavior of the common operations?
> >
> > I don't see a benefit of userspace passing a flag that's kind of
> > default for the VM type (assuming pKVM will use a special VM type).
>
> The benefit is that we don't need to have special VM default behavior for
> gmemfd. Think about if some day (very hypothetical and made up) we want to add a
> mode for TDX that adds new private data to a running guest (with special accept
> on the guest side or something). Then we might want to add a flag to override
> the default destructive behavior. Then maybe pKVM wants to add a "don't
> preserve" operation and it adds a second flag to not destroy. Now gmemfd has
> lots of VM specific flags. The point of this example is to show how unified uABI
> can he helpful.
Yep again. Pivoting on the VM type would be completely inflexible. If pKVM gains
a usecase that wants to zero memory on conversions, we're hosed. If SNP or TDX
gains the ability to preserve data on conversions, we're hosed.
The VM type may restrict what is possible, but (a) that should be abstracted,
e.g. by defining the allowed flags during guest_memfd creation, and (b) the
capabilities of the guest_memfd instance need to be communicated to userspace.
> > Common operations in guest_memfd will need to either check for the
> > userspace passed flag or the VM type, so no major change in
> > guest_memfd implementation for either mechanism.
>
> While we discuss ABI, we should allow ourselves to think ahead. So, is a gmemfd
> fd tied to a VM?
Yes.
> I think there is interest in de-coupling it?
No? Even if we get to a point where multiple distinct VMs can bind to a single
guest_memfd, e.g. for inter-VM shared memory, there will still need to be a sole
owner of the memory. AFAICT, fully decoupling guest_memfd from a VM would add
non-trivial complexity for zero practical benefit.
> Is the VM type sticky?
>
> It seems the more they are separate, the better it will be to not have VM-aware
> behavior living in gmem.
Ya. A guest_memfd instance may have capabilities/features that are restricted
and/or defined based on the properties of the owning VM, but we should do our
best to make guest_memfd itself blissly unaware of the VM type.
Powered by blists - more mailing lists