linux-kernel - Re: [PATCH v3 12/16] x86/virt/tdx: Add helpers to allow for pre-allocating pages

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <9018c8629c921fae9ee993cd83b5a189616f51b0.camel@intel.com>
Date: Sun, 28 Sep 2025 22:56:25 +0000
From: "Huang, Kai" <kai.huang@...el.com>
To: "kvm@...r.kernel.org" <kvm@...r.kernel.org>, "linux-coco@...ts.linux.dev"
	<linux-coco@...ts.linux.dev>, "Zhao, Yan Y" <yan.y.zhao@...el.com>,
	"dave.hansen@...ux.intel.com" <dave.hansen@...ux.intel.com>,
	"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
	"seanjc@...gle.com" <seanjc@...gle.com>, "mingo@...hat.com"
	<mingo@...hat.com>, "kas@...nel.org" <kas@...nel.org>, "pbonzini@...hat.com"
	<pbonzini@...hat.com>, "Yamahata, Isaku" <isaku.yamahata@...el.com>,
	"tglx@...utronix.de" <tglx@...utronix.de>, "Annapurve, Vishal"
	<vannapurve@...gle.com>, "Gao, Chao" <chao.gao@...el.com>, "Edgecombe, Rick
 P" <rick.p.edgecombe@...el.com>, "bp@...en8.de" <bp@...en8.de>,
	"x86@...nel.org" <x86@...nel.org>
Subject: Re: [PATCH v3 12/16] x86/virt/tdx: Add helpers to allow for
 pre-allocating pages

On Fri, 2025-09-26 at 23:47 +0000, Edgecombe, Rick P wrote:
> On Mon, 2025-09-22 at 11:20 +0000, Huang, Kai wrote:
> > Since 'struct tdx_prealloc' replaces the KVM standard 'struct
> > kvm_mmu_memory_cache' for external page table, and it is allowed to
> > fail in "topup" operation, why not just call tdx_alloc_page() to
> > "topup" page for external page table here?
> 
> I sympathize with the intuition. It would be nice to just prep
> everything and then operate on it like normal.
> 
> We want this to not have to be totally redone for huge pages. In the
> huge pages case, we could do this approach for the page tables, but for
> the private page itself, we don't know whether we need 4KB PAMT backing
> or not. So we don't fully know whether a TDX private page needs PAMT
> 4KB backing or not before the fault.
> 
> So we would need, like, separate pools for page tables and private
> pages. 
> 

The private page itself comes from the guest_memfd via page cache, so we
cannot use tdx_alloc_page() for it anyway but need to use tdx_pamt_get()
explicitly.

I don't know all details of exactly what is the interaction with huge page
here -- my imagine would be we only call tdx_pamt_get() when we found the
the private page is 4K but not 2M [*], but I don't see how this conflicts
with using tdx_alloc_page() for page table itself.

The point is page table itself is always 4K, therefore it has no
difference from control pages.

[*] this is actually an optimization but not a must for supporting
hugepage with DPAMT AFAICT.  Theoretically, we can always allocate DPAMT
pages upfront for hugepage at allocation time in guest_memfd regardless
whether it is actually mapped as hugepage in S-EPT, and we can free DPAMT
pages when we promote 4K pages to a hugepage in S-EPT to effectively
reduce DPAMT pages for hugepage.

> Or someway to unwind the wrong guess of small page. At that
> point I don't think it's simpler.
> 
> > 
> > I don't think we need to keep all "DPAMT pages" in the pool, right?
> 
> Not sure what you mean by this.

I mean we don't need to keep DPAMT pages in the list of 'struct
tdx_prealloc'.  tdx_pamt_put() get the DPAMT pages from the TDX module and
just frees them.

> 
> > 
> > If tdx_alloc_page() succeeds, then the "DPAMT pages" are also
> > "topup"ed, and PAMT entries for the 2M range of the SEPT page is
> > ready too.
> > 
> > This at least avoids having to export tdx_dpamt_entry_pages(), which
> > is not nice IMHO.  And I think it should yield simpler code.
> 
> I mean less exports is better, but I don't follow what is so egregious.
> It's not called from core KVM code.
> 
> > 
> > One more thinking:
> > 
> > I also have been thinking whether we can continue to use the KVM
> > standard 'struct kvm_mmu_memory_cache' for S-EPT pages.  Below is one
> > more idea for your reference.
> 
> The point of the new struct was to hand it to the arch/x86 side of the
> house. If we don't need to do that, then yes we could have options. And
> Dave suggested another struct that could be used to hand off the cache.
> 
> > 
> > In the previous discussion I think we concluded the 'kmem_cache'
> > doesn't work nicely with DPAMT (due to the ctor() cannot fail etc). 
> > And when we don't use 'kmem_cache', KVM just call __get_free_page()
> > to topup objects.
> > But we need tdx_alloc_page() for allocation here, so this is the
> > problem.
> > 
> > If we add two callbacks for object allocation/free to 'struct
> > kvm_mmu_memory_cache', then we can have place to hook
> > tdx_alloc_page().
> 
> kvm_mmu_memory_cache has a lot of options at this point. All we really
> need is a list. I'm not sure it makes sense to keep cramming things
> into it?

It comes down to whether we want to continue to reuse
'kvm_mmu_memory_cache' (which is already implemented in KVM), or we want
to use a different infrastructure for tracking S-EPT pages.

Anyway just my 2cents for your reference.