linux-kernel - Re: [RFC PATCH 2/5] x86/kexec: Do unconditional WBINVD for bare-metal in relocate

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <40b14084af8a35af4e07fbd394821f92d0973d32.camel@intel.com>
Date: Fri, 14 Mar 2025 09:44:53 +0000
From: "Huang, Kai" <kai.huang@...el.com>
To: "tglx@...utronix.de" <tglx@...utronix.de>, "peterz@...radead.org"
	<peterz@...radead.org>, "mingo@...hat.com" <mingo@...hat.com>, "Hansen, Dave"
	<dave.hansen@...el.com>, "Edgecombe, Rick P" <rick.p.edgecombe@...el.com>,
	"bp@...en8.de" <bp@...en8.de>, "kirill.shutemov@...ux.intel.com"
	<kirill.shutemov@...ux.intel.com>
CC: "ashish.kalra@....com" <ashish.kalra@....com>, "dyoung@...hat.com"
	<dyoung@...hat.com>, "thomas.lendacky@....com" <thomas.lendacky@....com>,
	"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
	"seanjc@...gle.com" <seanjc@...gle.com>, "dwmw@...zon.co.uk"
	<dwmw@...zon.co.uk>, "pbonzini@...hat.com" <pbonzini@...hat.com>,
	"bhe@...hat.com" <bhe@...hat.com>, "Yamahata, Isaku"
	<isaku.yamahata@...el.com>, "nik.borisov@...e.com" <nik.borisov@...e.com>,
	"Chatre, Reinette" <reinette.chatre@...el.com>, "hpa@...or.com"
	<hpa@...or.com>, "sagis@...gle.com" <sagis@...gle.com>,
	"david.kaplan@....com" <david.kaplan@....com>, "x86@...nel.org"
	<x86@...nel.org>, "Williams, Dan J" <dan.j.williams@...el.com>
Subject: Re: [RFC PATCH 2/5] x86/kexec: Do unconditional WBINVD for bare-metal
 in relocate_kernel()

On Thu, 2025-03-13 at 23:17 +0000, Edgecombe, Rick P wrote:
> On Thu, 2025-03-13 at 00:34 +1300, Kai Huang wrote:
> > For both SME and TDX, dirty cachelines with and without the encryption
> > bit(s) of the same physical memory address can coexist and the CPU can
> > flush them back to memory in random order.
> > 
> 
> A lot going on in this sentence, how about simplifying it:
> 
> For SME and TDX, multiple dirty cachelines for the same memory can co-exist, and
> the CPU can flush them back to memory in a random order.

"multiple" isn't accurate at least for SME.  How about:

 For SME and TDX, dirty cachelines with and without encryption bit(s) of 
 the same memory can coexist, and the CPU can flush them back to memory 
 in random order.

> 
> 
> >   During kexec, the caches
> > must be flushed before jumping to the new kernel to avoid silent memory
> > corruption to the new kernel.
> 
> During kexec, the caches must be flushed before jumping to the new kernel to
> avoid silent memory corruption when a cacheline with a different encryption
> property is written back over whatever encryption properties the new kernel is
> using.
> 
> ...it distributes some of the details from the first sentence into the second.
> Easier to read or no? I'm not sure.

I don't have opinion.  I see no difference.

I tends to keep the original words since people have reviewed.

> 
> > 
> > The WBINVD in stop_this_cpu() flushes caches for all remote CPUs when
> > they are being stopped.  For SME, the WBINVD in relocate_kernel()
> > flushes the cache for the last running CPU (which is doing kexec).
> > 
> > Similarly, to support kexec for TDX host, after stopping all remote CPUs
> > with cache flushed, the kernel needs to flush cache for the last running
> > CPU.
> 
> 
> I mentioned this in a previous version. I think you need to give some hint to
> where you are going before you start listing facts. Like:
> 
> During kexec, WBINVD needs to be executed on each CPU for TDX and SME. 
> 
> On the
> remote CPUs this is covered in stop_this_cpu() for both TDX and SME. For the
> kexecing CPU, SME handles this in relocate_kernel(). This leaves the TDX case
> for the kexec-ing CPU still to implement.
> 
> ...it first says the overall problem to solve, then explains what is missing in
> the current code to solve it. The reader is already thinking of what the
> solutions should be and...

Will do.

> 
> > 
> > Use the existing WBINVD in relocate_kernel() to cover TDX host as well.
> 
> ...they read the solution just as they are wondering about it. The reader can
> feel like the change is aligned with their thinking.
> 
> > 
> > Just do unconditional
> > 
> 
> "Unconditional". Now I'm starting to think about how unconditional wbinvd will
> be.
> 
> >  WBINVD to cover both SME and TDX instead of
> > sprinkling additional vendor-specific checks.  Kexec is a slow path, and
> > the additional WBINVD is acceptable for the sake of simplicity and
> > maintainability.
> > 
> > But only do WBINVD for bare-metal
> > 
> 
> But wait, now I'm learning it's not unconditional. I need to re-think what I
> just evaluated. And I'm doubting the earlier statements because I just got
> surprised.

Do you mean you got surprised by saying "unconditional" first and then saying
"for bare-metal"?  This is literally what the patch title says.  I don't see any
problem, but I can mentioned the "for bare-metal" part when I say
"unconditional" above.

> 
> >  because TDX guests and SEV-ES/SEV-SNP
> > guests will get unexpected (and yet unnecessary) exception (#VE or #VC)
> > which the kernel is unable to handle at the time of relocate_kernel()
> > since the kernel has torn down the IDT.
> > 
> > Remove the host_mem_enc_active local variable and directly use
> > !cpu_feature_enabled(X86_FEATURE_HYPERVISOR) as an argument of calling
> > relocate_kernel().
> > 
> 
> Start with the problem here. It just describes another change and I'm not sure
> why when I start reading it.
> 
> By problem I mean that host_mem_enc_active doesn't fit the conditional anymore,
> so it needs to be changed.
> 
> >   cpu_feature_enabled() is always inline but not a
> 
> I was just noticing this on the other patch. Actually it could call into some
> kasan stuff.

Can you be more specific since I am not seeing it.

> 
> > function call, thus it is safe to use after load_segments() when call
> > depth tracking is enabled.
> 
> This function call tracking stuff is a wild card at the end. What about
> describing the rules this function needs to follow due to call depth tracking,
> and explain why the change does that.

Below is what I had before.  Do you think it's better?  I replaced them with the
current one since Reinette commented the original one (which contains history
etc) was not necessary.

"
Commit 93c1800b3799 ("x86/kexec: Fix bug with call depth tracking")
moved calling 'cc_platform_has(CC_ATTR_HOST_MEM_ENCRYPT)' as an argument
of relocate_kernel() to an earlier place before load_segments() by
adding a variable 'host_mem_enc_active'.  The reason was the call to
cc_platform_has() after load_segments() caused a fault and system crash
when call depth tracking is active because load_segments() resets GS to
0 but call depth tracking uses per-CPU variable to operate.

Use !cpu_feature_enabled(X86_FEATURE_HYPERVISOR) to check whether the
kernel runs on bare-metal.  cpu_feature_enabled() is always inline but
not a function call, thus it is safe to use it after load_segments()
when call depth tracking is enabled.  Remove the 'host_mem_enc_active'
variable and use cpu_feature_enabled() directly as the argument when
calling relocate_kernel().
"


[...]

> > --- a/arch/x86/kernel/machine_kexec_64.c
> > +++ b/arch/x86/kernel/machine_kexec_64.c
> > @@ -346,16 +346,9 @@ void __nocfi machine_kexec(struct kimage *image)
> >  {
> >  	unsigned long reloc_start = (unsigned long)__relocate_kernel_start;
> >  	relocate_kernel_fn *relocate_kernel_ptr;
> > -	unsigned int host_mem_enc_active;
> >  	int save_ftrace_enabled;
> >  	void *control_page;
> >  
> > -	/*
> > -	 * This must be done before load_segments() since if call depth tracking
> > -	 * is used then GS must be valid to make any function calls.
> > -	 */
> > -	host_mem_enc_active = cc_platform_has(CC_ATTR_HOST_MEM_ENCRYPT);
> > -
> >  #ifdef CONFIG_KEXEC_JUMP
> >  	if (image->preserve_context)
> >  		save_processor_state();
> > @@ -398,6 +391,11 @@ void __nocfi machine_kexec(struct kimage *image)
> >  	 *
> >  	 * I take advantage of this here by force loading the
> >  	 * segments, before I zap the gdt with an invalid value.
> > +	 *
> > +	 * load_segments() resets GS to 0.  Don't make any function call
> > +	 * after here since call depth tracking uses per-CPU variables to
> > +	 * operate (relocate_kernel() is explicitly ignored by call depth
> > +	 * tracking).
> 
> I think I suggested you should call out the opportunistic change here in the
> log. Did you disagree?

I replied this was suggested by David Kaplan, but I guess I forgot to reply the
"opportunistic" part.

I don't think this is opportunistic change.  It's a valid comment after the 
'host_mem_enc_active' variable and the comment around it were removed.