linux-kernel - Re: [RFC PATCH 1/5] x86/kexec: Do unconditional WBINVD for bare-metal in stop_this

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <e0d9a3d599025c92fce5e159e8acc1af32844912.camel@intel.com>
Date: Thu, 13 Mar 2025 18:40:09 +0000
From: "Edgecombe, Rick P" <rick.p.edgecombe@...el.com>
To: "tglx@...utronix.de" <tglx@...utronix.de>, "peterz@...radead.org"
	<peterz@...radead.org>, "mingo@...hat.com" <mingo@...hat.com>, "Hansen, Dave"
	<dave.hansen@...el.com>, "Huang, Kai" <kai.huang@...el.com>, "bp@...en8.de"
	<bp@...en8.de>, "kirill.shutemov@...ux.intel.com"
	<kirill.shutemov@...ux.intel.com>
CC: "nik.borisov@...e.com" <nik.borisov@...e.com>, "bhe@...hat.com"
	<bhe@...hat.com>, "seanjc@...gle.com" <seanjc@...gle.com>, "x86@...nel.org"
	<x86@...nel.org>, "dyoung@...hat.com" <dyoung@...hat.com>, "sagis@...gle.com"
	<sagis@...gle.com>, "hpa@...or.com" <hpa@...or.com>, "Chatre, Reinette"
	<reinette.chatre@...el.com>, "Williams, Dan J" <dan.j.williams@...el.com>,
	"thomas.lendacky@....com" <thomas.lendacky@....com>, "pbonzini@...hat.com"
	<pbonzini@...hat.com>, "ashish.kalra@....com" <ashish.kalra@....com>,
	"Yamahata, Isaku" <isaku.yamahata@...el.com>, "linux-kernel@...r.kernel.org"
	<linux-kernel@...r.kernel.org>, "dwmw@...zon.co.uk" <dwmw@...zon.co.uk>
Subject: Re: [RFC PATCH 1/5] x86/kexec: Do unconditional WBINVD for bare-metal
 in stop_this_cpu()

On Thu, 2025-03-13 at 00:34 +1300, Kai Huang wrote:
> TL;DR:
> 
> Change to do unconditional WBINVD in stop_this_cpu() for bare metal to
> cover kexec support for both AMD SME and Intel TDX.  Previously there
> _was_ some issue preventing from doing so but now it has been fixed.
                             ^ Adding "the kernel" here would read a little
cleaner to me.


When I read "some issue" I start assuming it wasn't fully debugged and it is
"some issue" that no one knows. But below it sounds like it was root caused.

> Long version:

It might make this easier to digest this long version if it start with a "tell
them what you are going to tell them" paragraph.

> 
> AMD SME uses the C-bit to determine whether to encrypt the memory or
> not.  For the same physical memory address, dirty cachelines with and
> without the C-bit can coexist and the CPU can flush them back to memory
> in random order.  To support kexec for SME, the old kernel uses WBINVD
> to flush cache before booting to the new kernel so that no stale dirty
> cacheline are left over by the old kernel which could otherwise corrupt
> the new kernel's memory.
> 
> TDX uses 'KeyID' bits in the physical address for memory encryption and
> has the same requirement.  To support kexec for TDX, the old kernel
> needs to flush cache of TDX private memory.

This paragraph is a little jarring because it's not clear how it follows from
the first paragraph. It helps the reader to give some hints on how they should
organize the information as they go along. If it's too much of an info dump, it
puts an extra burden. They have to try to hold all of the facts in their head
until they can put together the bigger picture themselves.

The extra info about TDX using KeyID also seems unnecessary to the point (IIUC).

> 
> Currently, the kernel only performs WBINVD in stop_this_cpu() when SME
> is supported by hardware.  Perform unconditional WBINVD to support TDX
> instead of adding one more vendor-specific check.  Kexec is a slow path,
> and the additional WBINVD is acceptable for the sake of simplicity and
> maintainability.

Out of curiosity, do you know why this was not already needed for non-self snoop
CPUs? Why can't there be other cache modes that get written back after the new
kernel starts using the memory for something else?

> 
> Only do WBINVD on bare-metal.  Doing WBINVD in guest triggers unexpected
                                                ^the
> exception (#VE or #VC) for TDX and SEV-ES/SEV-SNP guests and the guest
> may not be able to handle such exception (e.g., TDX guests panics if it
                                                             ^panic
> sees such #VE).

It's a small thing, but I think you could skip the #VE or #VC info in here. All
they need to know to understand this patch is that TDX and some SEV kernels
cannot handle WBINVD. And TDX panics. (does SEV not?)

> 
> History of SME and kexec WBINVD:
> 
> There _was_ an issue preventing doing unconditional WBINVD but that has
> been fixed.
> 
> Initial SME kexec support added an unconditional WBINVD in commit
> 
>   bba4ed011a52: ("x86/mm, kexec: Allow kexec to be used with SME")
> 
> This commit caused different Intel systems to hang or reset.
> 
> Without a clear root cause, a later commit
> 
>   f23d74f6c66c: ("x86/mm: Rework wbinvd, hlt operation in stop_this_cpu()")
> 
> fixed the Intel system hang issues by only doing WBINVD when hardware
> supports SME.
> 
> A corner case [*] revealed the root cause of the system hang issues and
> was fixed by commit
> 
>   1f5e7eb7868e: ("x86/smp: Make stop_other_cpus() more robust")
> 
> See [1][2] for more information.
> 
> Further testing of doing unconditional WBINVD based on the above fix on
> the problematic machines (that issues were originally reported)
> confirmed the issues couldn't be reproduced.
> 
> See [3][4] for more information.
> 
> Therefore, it is safe to do unconditional WBINVD for bare-metal now.

Instead of a play-by-play, it might be more informative to summarize the edges
covered in this history:
 - Don't do anything that writes memory between wbinvd and native_halt(). This
includes function calls that touch the stack.
 - Avoid issuing wbinvd on multiple CPUs at the same time. As tglx implies it is
too expensive.
 - Don't race reboot by watching cpumask instead of num_online_cpus(). But there
is a race still.

Hmm, on the last one tglx says:
    The cpumask cannot plug all holes either, but it's better than a raw
    counter and allows to restrict the NMI fallback IPI to be sent only the
    CPUs which have not reported within the timeout window

Are we opening up more platforms to a race by unconditionally doing the wbinvd?
Can we be clarify that nothing bad happens if we lose the race? (and is it
true?)

> 
> [*] The commit didn't check whether the CPUID leaf is available or not.
> Making unsupported CPUID leaf on Intel returns garbage resulting in
> unintended WBINVD which caused some issue (followed by the analysis and
> the reveal of the final root cause).  The corner case was independently
> fixed by commit
> 
>   9b040453d444: ("x86/smp: Dont access non-existing CPUID leaf")
> 
> Link: https://lore.kernel.org/lkml/28a494ca-3173-4072-921c-6c5f5b257e79@amd.com/ [1]
> Link: https://lore.kernel.org/lkml/24844584-8031-4b58-ba5c-f85ef2f4c718@amd.com/ [2]
> Link: https://lore.kernel.org/lkml/20240221092856.GAZdXCWGJL7c9KLewv@fat_crate.local/ [3]
> Link: https://lore.kernel.org/lkml/CALu+AoSZkq1kz-xjvHkkuJ3C71d0SM5ibEJurdgmkZqZvNp2dQ@mail.gmail.com/ [4]
> Signed-off-by: Kai Huang <kai.huang@...el.com>
> Suggested-by: Borislav Petkov <bp@...en8.de>
> Cc: Tom Lendacky <thomas.lendacky@....com>
> Cc: Dave Young <dyoung@...hat.com>
> Reviewed-by: Tom Lendacky <thomas.lendacky@....com>
> ---
>  arch/x86/kernel/process.c | 21 +++++++++++----------
>  1 file changed, 11 insertions(+), 10 deletions(-)
> 
> diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
> index 9c75d701011f..8475d9d2d8c4 100644
> --- a/arch/x86/kernel/process.c
> +++ b/arch/x86/kernel/process.c
> @@ -819,18 +819,19 @@ void __noreturn stop_this_cpu(void *dummy)
>  	mcheck_cpu_clear(c);
>  
>  	/*
> -	 * Use wbinvd on processors that support SME. This provides support
> -	 * for performing a successful kexec when going from SME inactive
> -	 * to SME active (or vice-versa). The cache must be cleared so that
> -	 * if there are entries with the same physical address, both with and
> -	 * without the encryption bit, they don't race each other when flushed
> -	 * and potentially end up with the wrong entry being committed to
> -	 * memory.
> +	 * Use wbinvd to support kexec for both SME (from inactive to active
> +	 * or vice-versa) and TDX.  The cache must be cleared so that if there
> +	 * are entries with the same physical address, both with and without
> +	 * the encryption bit(s), they don't race each other when flushed and
> +	 * potentially end up with the wrong entry being committed to memory.
>  	 *
> -	 * Test the CPUID bit directly because the machine might've cleared
> -	 * X86_FEATURE_SME due to cmdline options.
> +	 * stop_this_cpu() isn't a fast path, just do unconditional WBINVD for
> +	 * bare-metal to cover both SME and TDX.  Do not do WBINVD in a guest
> +	 * since performing one will result in an exception (#VE or #VC) for
> +	 * TDX or SEV-ES/SEV-SNP guests which the guest may not be able to
> +	 * handle (e.g., TDX guest panics if it sees #VE).
>  	 */
> -	if (c->extended_cpuid_level >= 0x8000001f && (cpuid_eax(0x8000001f) & BIT(0)))
> +	if (!boot_cpu_has(X86_FEATURE_HYPERVISOR))
>  		wbinvd();

I see that this already has Tom's RB, but I'm not sure how this works for AMD.
The original SME patch tried to avoid writing to memory by putting the wbinvd
immediately before the halt, but today it is further away. Below this hunk there
are more instructions that could dirty memory before the halt.  Ohh... it's new.
9 months ago 26ba7353caaa ("x86/smp: Add smp_ops.stop_this_cpu() callback") adds
a function call that would touch the stack. I think it's wrong? And probably
introduced after this patch was originally written.

Then the cpuid_eax() could be non-inlined, but probably not. But the
boot_cpu_has() added in this patch could call out to kasan and dirty the stack.

So I think the existing SME case might be theoretically incorrect, and if so
this makes things very slightly worse.

>  
>  	/*