linux-kernel - Re: [PATCH] x86/virt/tdx: Make TDX and kexec mutually exclusive at runtime

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <31e17bc8-2e9e-4e93-a912-3d54826e59d0@intel.com>
Date: Thu, 17 Apr 2025 11:56:11 -0700
From: Dave Hansen <dave.hansen@...el.com>
To: "Edgecombe, Rick P" <rick.p.edgecombe@...el.com>,
 "tglx@...utronix.de" <tglx@...utronix.de>,
 "peterz@...radead.org" <peterz@...radead.org>,
 "mingo@...hat.com" <mingo@...hat.com>, "Huang, Kai" <kai.huang@...el.com>,
 "bp@...en8.de" <bp@...en8.de>
Cc: "ashish.kalra@....com" <ashish.kalra@....com>,
 "seanjc@...gle.com" <seanjc@...gle.com>, "x86@...nel.org" <x86@...nel.org>,
 "sagis@...gle.com" <sagis@...gle.com>, "hpa@...or.com" <hpa@...or.com>,
 "Chatre, Reinette" <reinette.chatre@...el.com>,
 "kirill.shutemov@...ux.intel.com" <kirill.shutemov@...ux.intel.com>,
 "Williams, Dan J" <dan.j.williams@...el.com>,
 "pbonzini@...hat.com" <pbonzini@...hat.com>,
 "thomas.lendacky@....com" <thomas.lendacky@....com>,
 "Yamahata, Isaku" <isaku.yamahata@...el.com>,
 "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
 "nik.borisov@...e.com" <nik.borisov@...e.com>
Subject: Re: [PATCH] x86/virt/tdx: Make TDX and kexec mutually exclusive at
 runtime

On 4/17/25 11:21, Edgecombe, Rick P wrote:
> On Thu, 2025-04-17 at 10:50 -0700, Dave Hansen wrote:
>> On 4/16/25 16:02, Kai Huang wrote:
>>> Full support for kexec on a TDX host would require complex work.
>>> The cache flushing required would need to happen while stopping
>>> remote CPUs, which would require changes to a fragile area of the
>>> kernel.
>>
>> Doesn't kexec already stop remote CPUs? Doesn't this boil down to a
>> WBINVD? How is that complex?
> 
> When SME added an SME-only WBINVD in stop_this_cpu() it caused a shutdown hang
> on some particular HW. It turns out there was an existing race that was made
> worse by the slower operation. It went through some attempts to fix it, and
> finally tglx patched it up with:
> 
>   1f5e7eb7868e ("x86/smp: Make stop_other_cpus() more robust")
> 
> But in that patch he said the fix "cannot plug all holes either". So while
> looking at doing the WBINVD for TDX kexec, I was advocating for giving this a
> harder look before building on top of it. The patches to add TDX kexec support
> made the WBINVD happen on all bare metal, not just TDX HW. So whatever races
> exist would be exposed to a much wider variety of HW than SME tested out.

I get it. Adding WBINVD to this same path caused some pain before. But
just turning off the feature that calls this path seems like overkill.

How about we try to push WBINVD out of this path? It should be quite
doable for TDX, I think.

Let's say we had a percpu bool. It get set when SME is enabled on the
system on each CPU. It also gets enabled when TDX is enabled. The kexec
code becomes:

-	if (SME)
+	if (per_cpu(newbool))
		wbinvd();

No TDX, no new wbinvd(). If SME, no change.

Now, here's where it gets fun. The bool can get _cleared_ after WBINVD
is executed on a CPU, at least on TDX systems. It then also needs to get
set after TDX might dirty a cacheline.

	TDCALL(); // dirties stuff
	per_cpu(newbool) = 1;

Then you can also do this on_each_cpu():

	wbinvd();
	per_cpu(newbool) = 0;

hopefully at point after you're sure no more TDCALLs are being made. If
you screw it up, no biggie: the kexec-time one will make up for it,
exposing TDX systems to the kexec timing bugs. But if the on_each_cpu()
thing works in the common case, you get no additional bug exposure.

>>> It would also require resetting TDX private pages, which is non-
>>> trivial since the core kernel does not track them.
>>
>> Why? The next kernel will just use KeyID-0 which will blast the old
>> pages away with no side effects... right?
> 
> I believe this is talking about support to work around the #MC errata. Another
> version of kexec TDX support used a KVM callback to have it reset all the TDX
> guest memory it knows about.

So, let's just not support hardware with that erratum upstream.