linux-kernel - Re: [PATCH] x86/virt/tdx: Make TDX and kexec mutually exclusive at runtime

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <13bc2339182107848b72976f970c824f4902240d.camel@intel.com>
Date: Tue, 22 Apr 2025 04:10:11 +0000
From: "Huang, Kai" <kai.huang@...el.com>
To: "tglx@...utronix.de" <tglx@...utronix.de>, "peterz@...radead.org"
	<peterz@...radead.org>, "mingo@...hat.com" <mingo@...hat.com>, "Hansen, Dave"
	<dave.hansen@...el.com>, "Edgecombe, Rick P" <rick.p.edgecombe@...el.com>,
	"bp@...en8.de" <bp@...en8.de>
CC: "seanjc@...gle.com" <seanjc@...gle.com>, "x86@...nel.org"
	<x86@...nel.org>, "sagis@...gle.com" <sagis@...gle.com>, "hpa@...or.com"
	<hpa@...or.com>, "Chatre, Reinette" <reinette.chatre@...el.com>,
	"kirill.shutemov@...ux.intel.com" <kirill.shutemov@...ux.intel.com>,
	"Williams, Dan J" <dan.j.williams@...el.com>, "ashish.kalra@....com"
	<ashish.kalra@....com>, "pbonzini@...hat.com" <pbonzini@...hat.com>,
	"Yamahata, Isaku" <isaku.yamahata@...el.com>, "linux-kernel@...r.kernel.org"
	<linux-kernel@...r.kernel.org>, "thomas.lendacky@....com"
	<thomas.lendacky@....com>, "nik.borisov@...e.com" <nik.borisov@...e.com>
Subject: Re: [PATCH] x86/virt/tdx: Make TDX and kexec mutually exclusive at
 runtime

On Thu, 2025-04-17 at 11:56 -0700, Dave Hansen wrote:
> On 4/17/25 11:21, Edgecombe, Rick P wrote:
> > On Thu, 2025-04-17 at 10:50 -0700, Dave Hansen wrote:
> > > On 4/16/25 16:02, Kai Huang wrote:
> > > > Full support for kexec on a TDX host would require complex work.
> > > > The cache flushing required would need to happen while stopping
> > > > remote CPUs, which would require changes to a fragile area of the
> > > > kernel.
> > > 
> > > Doesn't kexec already stop remote CPUs? Doesn't this boil down to a
> > > WBINVD? How is that complex?
> > 
> > When SME added an SME-only WBINVD in stop_this_cpu() it caused a shutdown hang
> > on some particular HW. It turns out there was an existing race that was made
> > worse by the slower operation. It went through some attempts to fix it, and
> > finally tglx patched it up with:
> > 
> >   1f5e7eb7868e ("x86/smp: Make stop_other_cpus() more robust")
> > 
> > But in that patch he said the fix "cannot plug all holes either". So while
> > looking at doing the WBINVD for TDX kexec, I was advocating for giving this a
> > harder look before building on top of it. The patches to add TDX kexec support
> > made the WBINVD happen on all bare metal, not just TDX HW. So whatever races
> > exist would be exposed to a much wider variety of HW than SME tested out.
> 
> I get it. Adding WBINVD to this same path caused some pain before. But
> just turning off the feature that calls this path seems like overkill.
> 
> How about we try to push WBINVD out of this path? It should be quite
> doable for TDX, I think.

Thanks for providing feedback.

> 
> Let's say we had a percpu bool. It get set when SME is enabled on the
> system on each CPU. 
> 

This fits SME, but there's one thing I would like to point out:

For SME there are two phases involving wbinvd():

1) Do wbinvd() in stop_this_cpu() which is for all remote CPUs;
2) Do wbinvd() in relocate_kernel() which is for the last kexec-ing CPU;

And the thing is, currently, the conditions to decide whether to perform
wbivnd() for the above two cases are different:

Case 1) checks whether the _hardware_ supports SME (i.e., via CPUID);
Case 2) checks whether the _kernel_ has ever enabled SME.

We can just expand the case 2) to match case 1).  I don't see any issue here. 
Expanding to doing wbinvd() for bare-metal machine does the same thing anyway.

So I think this approach fits SME easily.  We can just set the per-cpu(newbool)
in early_detect_mem_encrypt() by always checking whether the _hardware_ supports
SME.

> It also gets enabled when TDX is enabled. 
> 

Reading below, I think you mean we only enable the newbool when TDX is enabled
*at runtime*?


> The kexec
> code becomes:
> 
> -	if (SME)
> +	if (per_cpu(newbool))
> 		wbinvd();
> 
> No TDX, no new wbinvd(). If SME, no change.

Yes.

> 
> Now, here's where it gets fun. The bool can get _cleared_ after WBINVD
> is executed on a CPU, at least on TDX systems. 
> 

Right this only works for TDX.  For SME, wbinvd() is always needed, since any
single write to any memory would generate a new dirty cacheline which may have
different encryption property when the second kernel uses that memory.

> It then also needs to get
> set after TDX might dirty a cacheline.
> 
> 	TDCALL(); // dirties stuff
> 	per_cpu(newbool) = 1;
> 
> Then you can also do this on_each_cpu():
> 
> 	wbinvd();
> 	per_cpu(newbool) = 0;
> 
> hopefully at point after you're sure no more TDCALLs are being made. 
> 

(Assume you meant SEAMCALL rather than TDCALL since this context is for TDX
host.)

In the initial TDX support KVM is the only user, so KVM knows when no more
SEAMCALLs can be made.  E.g., we can do this when the last VM (or TD) is
destroyed, or when KVM module is unloaded.

> If
> you screw it up, no biggie: the kexec-time one will make up for it,
> exposing TDX systems to the kexec timing bugs. But if the on_each_cpu()
> thing works in the common case, you get no additional bug exposure.

This is true.  But I don't think it's a bug, because kexec can happen at any
time, e.g., when there's TD still running.

So unless KVM is notified when kexec happens (e.g., via a reboot notifier) and
terminates all TDX guests when notified before the kexec-ing CPU tries to stop
all remote CPUs, KVM cannot guarantee wbinvd() has been done and there will be
no more SEAMCALLs.

So in short:

I think this approach will work for both SME and TDX.  But I am not sure whether
we should/need to make KVM do

	wbinvd();
	per_cpu(newbool) = 0;

unless we also want to make KVM get notified when kexec happens.

Another two concerns are: 1) the above code doesn't work for SME as mentioned
above; 2) with TDX Connect other kernel components may need to do similar thing
too.

Maybe we can start with below?

- setting per-cpu(newbool) for TDX when TDX gets enabled at runtime;
- letting kexec code to catch the wbinvd() for TDX

Or even simpler:

- setting per-cpu(newbool) for TDX when X86_FEATURE_TDX_HOST_PLATFORM is set at
early boot;
- letting kexec code to catch the wbinvd() for TDX


> 
> > > > It would also require resetting TDX private pages, which is non-
> > > > trivial since the core kernel does not track them.
> > > 
> > > Why? The next kernel will just use KeyID-0 which will blast the old
> > > pages away with no side effects... right?
> > 
> > I believe this is talking about support to work around the #MC errata. Another
> > version of kexec TDX support used a KVM callback to have it reset all the TDX
> > guest memory it knows about.
> 
> So, let's just not support hardware with that erratum upstream.

This can be done by refusing kexec at image loading time as shown in this patch:

https://lore.kernel.org/lkml/408103f145360dfa04a41bc836ca2c724f29deb0.1741778537.git.kai.huang@intel.com/

But if we want to make KVM get notified when kexec happens, then resetting TDX
private memory can also be done easily because KVM already has the code to tear
down a TDX guest including resetting TDX private memory and flushing cache
associated with the TDX guest etc.