lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAGtprH9NbCPSwZrQAUzFw=4rZPA60QBM2G8opYo9CZxRiYihzg@mail.gmail.com>
Date: Fri, 11 Jul 2025 14:18:03 -0700
From: Vishal Annapurve <vannapurve@...gle.com>
To: Sean Christopherson <seanjc@...gle.com>
Cc: Rick P Edgecombe <rick.p.edgecombe@...el.com>, "pvorel@...e.cz" <pvorel@...e.cz>, 
	"kvm@...r.kernel.org" <kvm@...r.kernel.org>, "catalin.marinas@....com" <catalin.marinas@....com>, 
	Jun Miao <jun.miao@...el.com>, "palmer@...belt.com" <palmer@...belt.com>, 
	"pdurrant@...zon.co.uk" <pdurrant@...zon.co.uk>, "vbabka@...e.cz" <vbabka@...e.cz>, 
	"peterx@...hat.com" <peterx@...hat.com>, "x86@...nel.org" <x86@...nel.org>, 
	"amoorthy@...gle.com" <amoorthy@...gle.com>, "tabba@...gle.com" <tabba@...gle.com>, 
	"quic_svaddagi@...cinc.com" <quic_svaddagi@...cinc.com>, "maz@...nel.org" <maz@...nel.org>, 
	"vkuznets@...hat.com" <vkuznets@...hat.com>, 
	"anthony.yznaga@...cle.com" <anthony.yznaga@...cle.com>, 
	"mail@...iej.szmigiero.name" <mail@...iej.szmigiero.name>, 
	"quic_eberman@...cinc.com" <quic_eberman@...cinc.com>, Wei W Wang <wei.w.wang@...el.com>, 
	Fan Du <fan.du@...el.com>, 
	"Wieczor-Retman, Maciej" <maciej.wieczor-retman@...el.com>, Yan Y Zhao <yan.y.zhao@...el.com>, 
	"ajones@...tanamicro.com" <ajones@...tanamicro.com>, Dave Hansen <dave.hansen@...el.com>, 
	"paul.walmsley@...ive.com" <paul.walmsley@...ive.com>, 
	"quic_mnalajal@...cinc.com" <quic_mnalajal@...cinc.com>, "aik@....com" <aik@....com>, 
	"usama.arif@...edance.com" <usama.arif@...edance.com>, "fvdl@...gle.com" <fvdl@...gle.com>, 
	"jack@...e.cz" <jack@...e.cz>, "quic_cvanscha@...cinc.com" <quic_cvanscha@...cinc.com>, 
	Kirill Shutemov <kirill.shutemov@...el.com>, "willy@...radead.org" <willy@...radead.org>, 
	"steven.price@....com" <steven.price@....com>, "anup@...infault.org" <anup@...infault.org>, 
	"thomas.lendacky@....com" <thomas.lendacky@....com>, "keirf@...gle.com" <keirf@...gle.com>, 
	"mic@...ikod.net" <mic@...ikod.net>, 
	"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>, "nsaenz@...zon.es" <nsaenz@...zon.es>, 
	"akpm@...ux-foundation.org" <akpm@...ux-foundation.org>, 
	"oliver.upton@...ux.dev" <oliver.upton@...ux.dev>, 
	"binbin.wu@...ux.intel.com" <binbin.wu@...ux.intel.com>, "muchun.song@...ux.dev" <muchun.song@...ux.dev>, 
	Zhiquan1 Li <zhiquan1.li@...el.com>, "rientjes@...gle.com" <rientjes@...gle.com>, 
	Erdem Aktas <erdemaktas@...gle.com>, "mpe@...erman.id.au" <mpe@...erman.id.au>, 
	"david@...hat.com" <david@...hat.com>, "jgg@...pe.ca" <jgg@...pe.ca>, "hughd@...gle.com" <hughd@...gle.com>, 
	"jhubbard@...dia.com" <jhubbard@...dia.com>, Haibo1 Xu <haibo1.xu@...el.com>, 
	Isaku Yamahata <isaku.yamahata@...el.com>, "jthoughton@...gle.com" <jthoughton@...gle.com>, 
	"rppt@...nel.org" <rppt@...nel.org>, "steven.sistare@...cle.com" <steven.sistare@...cle.com>, 
	"jarkko@...nel.org" <jarkko@...nel.org>, "quic_pheragu@...cinc.com" <quic_pheragu@...cinc.com>, 
	"chenhuacai@...nel.org" <chenhuacai@...nel.org>, Kai Huang <kai.huang@...el.com>, 
	"shuah@...nel.org" <shuah@...nel.org>, "bfoster@...hat.com" <bfoster@...hat.com>, 
	"dwmw@...zon.co.uk" <dwmw@...zon.co.uk>, Chao P Peng <chao.p.peng@...el.com>, 
	"pankaj.gupta@....com" <pankaj.gupta@....com>, Alexander Graf <graf@...zon.com>, 
	"nikunj@....com" <nikunj@....com>, "viro@...iv.linux.org.uk" <viro@...iv.linux.org.uk>, 
	"pbonzini@...hat.com" <pbonzini@...hat.com>, "yuzenghui@...wei.com" <yuzenghui@...wei.com>, 
	"jroedel@...e.de" <jroedel@...e.de>, "suzuki.poulose@....com" <suzuki.poulose@....com>, 
	"jgowans@...zon.com" <jgowans@...zon.com>, Yilun Xu <yilun.xu@...el.com>, 
	"liam.merwick@...cle.com" <liam.merwick@...cle.com>, "michael.roth@....com" <michael.roth@....com>, 
	"quic_tsoni@...cinc.com" <quic_tsoni@...cinc.com>, Xiaoyao Li <xiaoyao.li@...el.com>, 
	"aou@...s.berkeley.edu" <aou@...s.berkeley.edu>, Ira Weiny <ira.weiny@...el.com>, 
	"richard.weiyang@...il.com" <richard.weiyang@...il.com>, 
	"kent.overstreet@...ux.dev" <kent.overstreet@...ux.dev>, "qperret@...gle.com" <qperret@...gle.com>, 
	"dmatlack@...gle.com" <dmatlack@...gle.com>, "james.morse@....com" <james.morse@....com>, 
	"brauner@...nel.org" <brauner@...nel.org>, 
	"linux-fsdevel@...r.kernel.org" <linux-fsdevel@...r.kernel.org>, 
	"ackerleytng@...gle.com" <ackerleytng@...gle.com>, "pgonda@...gle.com" <pgonda@...gle.com>, 
	"quic_pderrin@...cinc.com" <quic_pderrin@...cinc.com>, "roypat@...zon.co.uk" <roypat@...zon.co.uk>, 
	"hch@...radead.org" <hch@...radead.org>, "will@...nel.org" <will@...nel.org>, 
	"linux-mm@...ck.org" <linux-mm@...ck.org>
Subject: Re: [RFC PATCH v2 00/51] 1G page support for guest_memfd

On Wed, Jul 9, 2025 at 6:30 PM Vishal Annapurve <vannapurve@...gle.com> wrote:
> > > 3) KVM should ideally associate the lifetime of backing
> > > pagetables/protection tables/RMP tables with the lifetime of the
> > > binding of memslots with guest_memfd.
> >
> > Again, please align your indentation.
> >
> > >          - Today KVM SNP logic ties RMP table entry lifetimes with how
> > >            long the folios are mapped in guest_memfd, which I think should be
> > >            revisited.
> >
> > Why?  Memslots are ephemeral per-"struct kvm" mappings.  RMP entries and guest_memfd
> > inodes are tied to the Virtual Machine, not to the "struct kvm" instance.
>
> IIUC guest_memfd can only be accessed through the window of memslots
> and if there are no memslots I don't see the reason for memory still
> being associated with "virtual machine". Likely because I am yet to
> completely wrap my head around 'guest_memfd inodes are tied to the
> Virtual Machine, not to the "struct kvm" instance', I need to spend
> more time on this one.
>

I see the benefits of tying inodes to the virtual machine and
different guest_memfd files to different KVM instances. This allows us
to exercise intra-host migration usecases for TDX/SNP. But I think
this model doesn't allow us to reuse guest_memfd files for SNP VMs
during reboot.

Reboot scenario assuming reuse of existing guest_memfd inode for the
next instance:
1) Create a VM
2) Create guest_memfd files that pin KVM instance
3) Create memslots
4) Start the VM
5) For reboot/shutdown, Execute VM specific Termination (e.g.
KVM_TDX_TERMINATE_VM)
6) if allowed, delete the memslots
7) Create a new VM instance
8) Link the existing guest_memfd files to the new VM -> which creates
new files for the same inode.
9) Close the existing guest_memfd files and the existing VM
10) Jump to step 3

The difference between SNP and TDX is that TDX memory ownership is
limited to the duration the pages are mapped in the second stage
secure EPT tables, whereas SNP/RMP memory ownership lasts beyond
memslots and effectively remains till folios are punched out from
guest_memfd filemap. IIUC CCA might follow the suite of SNP in this
regard with the pfns populated in GPT entries.

I don't have a sense of how critical this problem could be, but this
would mean for every reboot all large memory allocations will have to
let go and need to be reallocated. For 1G support, we will be freeing
guest_memfd pages using a background thread which may add some delays
in being able to free up the memory in time.

Instead if we did this:
1) Support creating guest_memfd files for a certain VM type that
allows KVM to dictate the behavior of the guest_memfd.
2) Tie lifetime of KVM SNP/TDX memory ownership with guest_memfd and
memslot bindings
    - Each binding will increase a refcount on both guest_memfd file
and KVM, so both can't go away while the binding exists.
3) For SNP/CCA, pfns are invalidated from RMP/GPT tables during unbind
operations while for TDX, KVM will invalidate secure EPT entries.

This can allow us to decouple memory lifecycle from VM lifecycle and
match the behavior with non-confidential VMs where memory can outlast
VMs. Though this approach will mean change in intrahost migration
implementation as we don't need to differentiate guest_memfd files and
inodes.

That being said, I might be missing something here and I don't have
any data to back the criticality of this usecase for SNP and possibly
CCA VMs.

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ