linux-kernel - Re: [PATCH V4 1/1] KVM: TDX: Add sub-ioctl KVM_TDX_TERMINATE

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <CAGtprH-rUuk=9shX9bsP4K=UPVvG1cUJCiXBfW07mZ1cjtkcQw@mail.gmail.com>
Date: Mon, 23 Jun 2025 13:22:35 -0700
From: Vishal Annapurve <vannapurve@...gle.com>
To: "Edgecombe, Rick P" <rick.p.edgecombe@...el.com>
Cc: "Gao, Chao" <chao.gao@...el.com>, "seanjc@...gle.com" <seanjc@...gle.com>, 
	"Huang, Kai" <kai.huang@...el.com>, 
	"binbin.wu@...ux.intel.com" <binbin.wu@...ux.intel.com>, "Chatre, Reinette" <reinette.chatre@...el.com>, 
	"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>, "Hunter, Adrian" <adrian.hunter@...el.com>, 
	"Li, Xiaoyao" <xiaoyao.li@...el.com>, 
	"tony.lindgren@...ux.intel.com" <tony.lindgren@...ux.intel.com>, 
	"kirill.shutemov@...ux.intel.com" <kirill.shutemov@...ux.intel.com>, 
	"Yamahata, Isaku" <isaku.yamahata@...el.com>, "kvm@...r.kernel.org" <kvm@...r.kernel.org>, 
	"Zhao, Yan Y" <yan.y.zhao@...el.com>, "pbonzini@...hat.com" <pbonzini@...hat.com>
Subject: Re: [PATCH V4 1/1] KVM: TDX: Add sub-ioctl KVM_TDX_TERMINATE_VM

On Mon, Jun 23, 2025 at 9:23 AM Edgecombe, Rick P
<rick.p.edgecombe@...el.com> wrote:
>
> On Fri, 2025-06-20 at 20:00 -0700, Vishal Annapurve wrote:
> > Can you provide enough information to evaluate how the whole problem is being
> > > solved? (it sounds like you have the full solution implemented?)
> > >
> > > The problem seems to be that rebuilding a whole TD for reboot is too slow. Does
> > > the S-EPT survive if the VM is destroyed? If not, how does keeping the pages in
> > > guestmemfd help with re-faulting? If the S-EPT is preserved, then what happens
> > > when the new guest re-accepts it?
> >
> > SEPT entries don't survive reboots.
> >
> > The faulting-in I was referring to is just allocation of memory pages
> > for guest_memfd offsets.
> >
> > >
> > > >
> > > > >
> > > > > The series Vishal linked has some kind of SEV state transfer thing. How is
> > > > > it
> > > > > intended to work for TDX?
> > > >
> > > > The series[1] unblocks intrahost-migration [2] and reboot usecases.
> > > >
> > > > [1] https://lore.kernel.org/lkml/cover.1747368092.git.afranji@google.com/#t
> > > > [2] https://lore.kernel.org/lkml/cover.1749672978.git.afranji@google.com/#t
> > >
> > > The question was: how was this reboot optimization intended to work for TDX? Are
> > > you saying that it works via intra-host migration? Like some state is migrated
> > > to the new TD to start it up?
> >
> > Reboot optimization is not specific to TDX, it's basically just about
> > trying to reuse the same physical memory for the next boot. No state
> > is preserved here except the mapping of guest_memfd offsets to
> > physical memory pages.
>
> Hmm, it doesn't sound like much work, especially at the 1GB level. I wonder if
> it has something to do with the cost of zeroing the pages. If they went to a
> global allocator and back, they would need to be zeroed to make sure data is not
> leaked to another userspace process. But if it stays with the fd, this could be
> skipped?

A simple question I ask to myself is that if a certain memory specific
optimization/feature is enabled for non-confidential VMs, why it can't
be enabled for Confidential VMs. I think as long as we cleanly
separate memory management from RMP/SEPT management for CVMs, there
should ideally be no major issues with enabling such optimizations for
Confidential VMs.

Just memory allocation without zeroing, even with hugepages takes time
for large VM shapes and I don't really see a valid reason for the
userspace VMM to repeat the freeing and allocation cycles.

> For TDX though, hmm, we may not actually need to zero the private pages because
> of the transition to keyid 0. It would be beneficial to have the different VMs
> types work the same. But, under this speculation of the real benefit, there may
> be other ways to get the same benefits that are worth considering when we hit
> frictions like this. To do that kind of consideration though, everyone needs to
> understand what the real goal is.
>
> In general I think we really need to fully evaluate these optimizations as part
> of the upstreaming process. We have already seen two post-base series TDX
> optimizations that didn't stand up under scrutiny. It turned out the existing
> TDX page promotion implementation wasn't actually getting used much if at all.
> Also, the parallel TD reclaim thing turned out to be misguided once we looked

For a ~700G guest memory, guest shutdown times:
1) Parallel TD reclaim + hugepage EPT mappings  : 30 secs
2) TD shutdown with KVM_TDX_TERMINATE_VM + hugepage EPT mappings: 2 mins
3) Without any optimization: ~ 30-40 mins

KVM_TDX_TERMINATE_VM for now is a very good start and is much simpler
to upstream.

> into the root cause. So if we blindly incorporate optimizations based on vague
> or promised justification, it seems likely we will end up maintaining some
> amount of complex code with no purpose. Then it will be difficult to prove later
> that it is not needed, and just remain a burden.
>
> So can we please start explaining more of the "why" for this stuff so we can get
> to the best upstream solution?