linux-kernel - Re: [PATCHv2 00/12] TDX: Enable Dynamic PAMT

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <b5ce9dfe7277fa976da5b762545ca213e649fcbc.camel@intel.com>
Date: Tue, 12 Aug 2025 23:34:59 +0000
From: "Edgecombe, Rick P" <rick.p.edgecombe@...el.com>
To: "Annapurve, Vishal" <vannapurve@...gle.com>
CC: "Gao, Chao" <chao.gao@...el.com>, "seanjc@...gle.com" <seanjc@...gle.com>,
	"Huang, Kai" <kai.huang@...el.com>, "kas@...nel.org" <kas@...nel.org>,
	"tglx@...utronix.de" <tglx@...utronix.de>, "bp@...en8.de" <bp@...en8.de>,
	"mingo@...hat.com" <mingo@...hat.com>, "Zhao, Yan Y" <yan.y.zhao@...el.com>,
	"x86@...nel.org" <x86@...nel.org>, "linux-kernel@...r.kernel.org"
	<linux-kernel@...r.kernel.org>, "pbonzini@...hat.com" <pbonzini@...hat.com>,
	"Yamahata, Isaku" <isaku.yamahata@...el.com>, "dave.hansen@...ux.intel.com"
	<dave.hansen@...ux.intel.com>, "linux-coco@...ts.linux.dev"
	<linux-coco@...ts.linux.dev>, "kvm@...r.kernel.org" <kvm@...r.kernel.org>
Subject: Re: [PATCHv2 00/12] TDX: Enable Dynamic PAMT

On Tue, 2025-08-12 at 15:00 -0700, Vishal Annapurve wrote:
> IMO, tieing lifetime of guest_memfd folios with that of KVM ownership
> beyond the memslot lifetime is leaking more state into guest_memfd
> than needed. e.g. This will prevent usecases where guest_memfd needs
> to be reused while handling reboot of a confidential VM [1].

How does it prevent this? If you really want to re-use guest memory in a fast
way then I think you would want the DPAMT to remain in place actually. It sounds
like an argument to trigger the add/remove from guestmemfd actually.

But I really question with all the work to rebuild S-EPT, and if you propose
DPAMT too, how much work is really gained by not needing to reallocate hugetlbfs
pages. Do you see how it could be surprising? I'm currently assuming there is
some missing context.

> 
> IMO, if avoidable, its better to not have DPAMT or generally other KVM
> arch specific state tracking hooked up to guest memfd folios specially
> with hugepage support and whole folio splitting/merging that needs to
> be done. If you still need it, guest_memfd should be stateless as much
> as possible just like we are pushing for SNP preparation tracking [2]
> to happen within KVM SNP and IMO any such tracking should ideally be
> cleaned up on memslot unbinding.

I'm not sure gmemfd would need to track state. It could be a callback. But it
may be academic anyway. Below...

> 
> [1] https://lore.kernel.org/kvm/CAGtprH9NbCPSwZrQAUzFw=4rZPA60QBM2G8opYo9CZxRiYihzg@mail.gmail.com/
> [2] https://lore.kernel.org/kvm/20250613005400.3694904-2-michael.roth@amd.com/

Looking into that more, from the code it seems it's not quite so
straightforward. Demote will always require new DPAMT pages to be passed, and
promote will always remove the 4k DPAMT entries and pass them back to the host.
But on early reading, 2MB PAGE.AUG looks like it can handle the DPAMT being
mapped at 4k. So maybe there is some wiggle room there? But before I dig
further, I think I've heard 4 possible arguments for keeping the existing
design:

1. TDX module may require or at least push the caller to have S-EPT match DPAMT
size (confirmation TBD)
2. Mapping DPAMT all at 4k requires extra memory for TD huge pages
3. It *may* slow TD boots because things can't be lazily installed via the fault
path. (testing not done)
4. While the global lock is bad, there is an easy fix for that if it is needed.

It seems Vishal cares a lot about (2). So I'm wondering if we need to keep going
down this path.

In the meantime, I'm going to try to get some better data on the global lock
contention (Sean's question about how much of the memory was actually faulted
in).