linux-kernel - Re: [RFC PATCH 08/21] KVM: TDX: Increase/decrease folio ref for huge pages

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <aGUW5PlofbcNJ7s1@yzhao56-desk.sh.intel.com>
Date: Wed, 2 Jul 2025 19:24:20 +0800
From: Yan Zhao <yan.y.zhao@...el.com>
To: Ackerley Tng <ackerleytng@...gle.com>
CC: Vishal Annapurve <vannapurve@...gle.com>, "Edgecombe, Rick P"
	<rick.p.edgecombe@...el.com>, "quic_eberman@...cinc.com"
	<quic_eberman@...cinc.com>, "Li, Xiaoyao" <xiaoyao.li@...el.com>, "Shutemov,
 Kirill" <kirill.shutemov@...el.com>, "Hansen, Dave" <dave.hansen@...el.com>,
	"david@...hat.com" <david@...hat.com>, "thomas.lendacky@....com"
	<thomas.lendacky@....com>, "vbabka@...e.cz" <vbabka@...e.cz>,
	"tabba@...gle.com" <tabba@...gle.com>, "Du, Fan" <fan.du@...el.com>,
	"michael.roth@....com" <michael.roth@....com>, "seanjc@...gle.com"
	<seanjc@...gle.com>, "binbin.wu@...ux.intel.com" <binbin.wu@...ux.intel.com>,
	"Peng, Chao P" <chao.p.peng@...el.com>, "kvm@...r.kernel.org"
	<kvm@...r.kernel.org>, "Yamahata, Isaku" <isaku.yamahata@...el.com>,
	"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>, "Weiny, Ira"
	<ira.weiny@...el.com>, "pbonzini@...hat.com" <pbonzini@...hat.com>, "Li,
 Zhiquan1" <zhiquan1.li@...el.com>, "jroedel@...e.de" <jroedel@...e.de>,
	"Miao, Jun" <jun.miao@...el.com>, "pgonda@...gle.com" <pgonda@...gle.com>,
	"x86@...nel.org" <x86@...nel.org>
Subject: Re: [RFC PATCH 08/21] KVM: TDX: Increase/decrease folio ref for huge
 pages

On Tue, Jul 01, 2025 at 03:09:01PM -0700, Ackerley Tng wrote:
> Yan Zhao <yan.y.zhao@...el.com> writes:
> 
> > On Mon, Jun 30, 2025 at 10:22:26PM -0700, Vishal Annapurve wrote:
> >> On Mon, Jun 30, 2025 at 10:04 PM Yan Zhao <yan.y.zhao@...el.com> wrote:
> >> >
> >> > On Tue, Jul 01, 2025 at 05:45:54AM +0800, Edgecombe, Rick P wrote:
> >> > > On Mon, 2025-06-30 at 12:25 -0700, Ackerley Tng wrote:
> >> > > > > So for this we can do something similar. Have the arch/x86 side of TDX grow
> >> > > > > a
> >> > > > > new tdx_buggy_shutdown(). Have it do an all-cpu IPI to kick CPUs out of
> >> > > > > SEAMMODE, wbivnd, and set a "no more seamcalls" bool. Then any SEAMCALLs
> >> > > > > after
> >> > > > > that will return a TDX_BUGGY_SHUTDOWN error, or similar. All TDs in the
> >> > > > > system
> >> > > > > die. Zap/cleanup paths return success in the buggy shutdown case.
> >> > > > >
> >> > > >
> >> > > > Do you mean that on unmap/split failure:
> >> > >
> >> > > Maybe Yan can clarify here. I thought the HWpoison scenario was about TDX module
> >> > My thinking is to set HWPoison to private pages whenever KVM_BUG_ON() was hit in
> >> > TDX. i.e., when the page is still mapped in S-EPT but the TD is bugged on and
> >> > about to tear down.
> >> >
> >> > So, it could be due to KVM or TDX module bugs, which retries can't help.
> >> >
> >> > > bugs. Not TDX busy errors, demote failures, etc. If there are "normal" failures,
> >> > > like the ones that can be fixed with retries, then I think HWPoison is not a
> >> > > good option though.
> >> > >
> >> > > >  there is a way to make 100%
> >> > > > sure all memory becomes re-usable by the rest of the host, using
> >> > > > tdx_buggy_shutdown(), wbinvd, etc?
> >> >
> >> > Not sure about this approach. When TDX module is buggy and the page is still
> >> > accessible to guest as private pages, even with no-more SEAMCALLs flag, is it
> >> > safe enough for guest_memfd/hugetlb to re-assign the page to allow simultaneous
> >> > access in shared memory with potential private access from TD or TDX module?
> >> 
> >> If no more seamcalls are allowed and all cpus are made to exit SEAM
> >> mode then how can there be potential private access from TD or TDX
> >> module?
> > Not sure. As Kirill said "TDX module has creative ways to corrupt it"
> > https://lore.kernel.org/all/zlxgzuoqwrbuf54wfqycnuxzxz2yduqtsjinr5uq4ss7iuk2rt@qaaolzwsy6ki/.
> >
> > Or, could TDX just set a page flag, like what for XEN
> >
> >         /* XEN */
> >         /* Pinned in Xen as a read-only pagetable page. */
> >         PG_pinned = PG_owner_priv_1,
> >
> > e.g.
> > 	PG_tdx_firmware_access = PG_owner_priv_1,
> >
> > Then, guest_memfd checks this flag on every zap and replace it with PG_hwpoison
> > on behalf of TDX?
> 
> I think this question probably arose because of a misunderstanding I
> might have caused. I meant to set the HWpoison flag from the kernel, not
> from within the TDX module. Please see [1].
I understood.
But as Rick pointed out
https://lore.kernel.org/all/04d3e455d07042a0ab8e244e6462d9011c914581.camel@intel.com/,
Manually setting the poison flag in KVM's TDX code (in host kernel) seems risky.

> In addition, if the TDX module (now referring specifically to the TDX
> module and not the kernel) sets page flags, that won't work with
Marking at per-folio level seems acceptable to me.

> vmemmap-optimized folios. Setting a page flag on a vmemmap-optimized
> folio will be setting the flag on a few pages.
BTW, I have a concern regarding to the overhead vmemmap-optimization.

In my system,
with hugetlb_free_vmemmap=false, the TD boot time is around 30s;
with hugetlb_free_vmemmap=true, the TD boot time is around 1m20s;


> [1] https://lore.kernel.org/all/diqzplej4llh.fsf@ackerleytng-ctop.c.googlers.com/