linux-kernel - Re: [RFC PATCH 1/5] x86/kexec: Do unconditional WBINVD for bare-metal in stop_this

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <e7adcb16-ff6c-d901-155b-866be4de2d40@amd.com>
Date: Fri, 14 Mar 2025 10:11:59 -0500
From: Tom Lendacky <thomas.lendacky@....com>
To: "Edgecombe, Rick P" <rick.p.edgecombe@...el.com>,
 "tglx@...utronix.de" <tglx@...utronix.de>,
 "peterz@...radead.org" <peterz@...radead.org>,
 "mingo@...hat.com" <mingo@...hat.com>, "Hansen, Dave"
 <dave.hansen@...el.com>, "Huang, Kai" <kai.huang@...el.com>,
 "bp@...en8.de" <bp@...en8.de>,
 "kirill.shutemov@...ux.intel.com" <kirill.shutemov@...ux.intel.com>
Cc: "nik.borisov@...e.com" <nik.borisov@...e.com>,
 "bhe@...hat.com" <bhe@...hat.com>, "seanjc@...gle.com" <seanjc@...gle.com>,
 "x86@...nel.org" <x86@...nel.org>, "dyoung@...hat.com" <dyoung@...hat.com>,
 "sagis@...gle.com" <sagis@...gle.com>, "hpa@...or.com" <hpa@...or.com>,
 "Chatre, Reinette" <reinette.chatre@...el.com>,
 "Williams, Dan J" <dan.j.williams@...el.com>,
 "pbonzini@...hat.com" <pbonzini@...hat.com>,
 "ashish.kalra@....com" <ashish.kalra@....com>,
 "Yamahata, Isaku" <isaku.yamahata@...el.com>,
 "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
 "dwmw@...zon.co.uk" <dwmw@...zon.co.uk>
Subject: Re: [RFC PATCH 1/5] x86/kexec: Do unconditional WBINVD for bare-metal
 in stop_this_cpu()

On 3/13/25 13:40, Edgecombe, Rick P wrote:
> On Thu, 2025-03-13 at 00:34 +1300, Kai Huang wrote:
>> TL;DR:
>>
>> Change to do unconditional WBINVD in stop_this_cpu() for bare metal to
>> cover kexec support for both AMD SME and Intel TDX.  Previously there
>> _was_ some issue preventing from doing so but now it has been fixed.
>                              ^ Adding "the kernel" here would read a little
> cleaner to me.
> 
> 
> When I read "some issue" I start assuming it wasn't fully debugged and it is
> "some issue" that no one knows. But below it sounds like it was root caused.
> 
>> Long version:
> 
> It might make this easier to digest this long version if it start with a "tell
> them what you are going to tell them" paragraph.
> 
>>
>> AMD SME uses the C-bit to determine whether to encrypt the memory or
>> not.  For the same physical memory address, dirty cachelines with and
>> without the C-bit can coexist and the CPU can flush them back to memory
>> in random order.  To support kexec for SME, the old kernel uses WBINVD
>> to flush cache before booting to the new kernel so that no stale dirty
>> cacheline are left over by the old kernel which could otherwise corrupt
>> the new kernel's memory.
>>
>> TDX uses 'KeyID' bits in the physical address for memory encryption and
>> has the same requirement.  To support kexec for TDX, the old kernel
>> needs to flush cache of TDX private memory.
> 
> This paragraph is a little jarring because it's not clear how it follows from
> the first paragraph. It helps the reader to give some hints on how they should
> organize the information as they go along. If it's too much of an info dump, it
> puts an extra burden. They have to try to hold all of the facts in their head
> until they can put together the bigger picture themselves.
> 
> The extra info about TDX using KeyID also seems unnecessary to the point (IIUC).
> 
>>
>> Currently, the kernel only performs WBINVD in stop_this_cpu() when SME
>> is supported by hardware.  Perform unconditional WBINVD to support TDX
>> instead of adding one more vendor-specific check.  Kexec is a slow path,
>> and the additional WBINVD is acceptable for the sake of simplicity and
>> maintainability.
> 
> Out of curiosity, do you know why this was not already needed for non-self snoop
> CPUs? Why can't there be other cache modes that get written back after the new
> kernel starts using the memory for something else?
> 
>>
>> Only do WBINVD on bare-metal.  Doing WBINVD in guest triggers unexpected
>                                                 ^the
>> exception (#VE or #VC) for TDX and SEV-ES/SEV-SNP guests and the guest
>> may not be able to handle such exception (e.g., TDX guests panics if it
>                                                              ^panic
>> sees such #VE).
> 
> It's a small thing, but I think you could skip the #VE or #VC info in here. All
> they need to know to understand this patch is that TDX and some SEV kernels
> cannot handle WBINVD. And TDX panics. (does SEV not?)
> 
>>
>> History of SME and kexec WBINVD:
>>
>> There _was_ an issue preventing doing unconditional WBINVD but that has
>> been fixed.
>>
>> Initial SME kexec support added an unconditional WBINVD in commit
>>
>>   bba4ed011a52: ("x86/mm, kexec: Allow kexec to be used with SME")
>>
>> This commit caused different Intel systems to hang or reset.
>>
>> Without a clear root cause, a later commit
>>
>>   f23d74f6c66c: ("x86/mm: Rework wbinvd, hlt operation in stop_this_cpu()")
>>
>> fixed the Intel system hang issues by only doing WBINVD when hardware
>> supports SME.
>>
>> A corner case [*] revealed the root cause of the system hang issues and
>> was fixed by commit
>>
>>   1f5e7eb7868e: ("x86/smp: Make stop_other_cpus() more robust")
>>
>> See [1][2] for more information.
>>
>> Further testing of doing unconditional WBINVD based on the above fix on
>> the problematic machines (that issues were originally reported)
>> confirmed the issues couldn't be reproduced.
>>
>> See [3][4] for more information.
>>
>> Therefore, it is safe to do unconditional WBINVD for bare-metal now.
> 
> Instead of a play-by-play, it might be more informative to summarize the edges
> covered in this history:
>  - Don't do anything that writes memory between wbinvd and native_halt(). This
> includes function calls that touch the stack.
>  - Avoid issuing wbinvd on multiple CPUs at the same time. As tglx implies it is
> too expensive.
>  - Don't race reboot by watching cpumask instead of num_online_cpus(). But there
> is a race still.
> 
> Hmm, on the last one tglx says:
>     The cpumask cannot plug all holes either, but it's better than a raw
>     counter and allows to restrict the NMI fallback IPI to be sent only the
>     CPUs which have not reported within the timeout window
> 
> Are we opening up more platforms to a race by unconditionally doing the wbinvd?
> Can we be clarify that nothing bad happens if we lose the race? (and is it
> true?)
> 
>>
>> [*] The commit didn't check whether the CPUID leaf is available or not.
>> Making unsupported CPUID leaf on Intel returns garbage resulting in
>> unintended WBINVD which caused some issue (followed by the analysis and
>> the reveal of the final root cause).  The corner case was independently
>> fixed by commit
>>
>>   9b040453d444: ("x86/smp: Dont access non-existing CPUID leaf")
>>
>> Link: https://lore.kernel.org/lkml/28a494ca-3173-4072-921c-6c5f5b257e79@amd.com/ [1]
>> Link: https://lore.kernel.org/lkml/24844584-8031-4b58-ba5c-f85ef2f4c718@amd.com/ [2]
>> Link: https://lore.kernel.org/lkml/20240221092856.GAZdXCWGJL7c9KLewv@fat_crate.local/ [3]
>> Link: https://lore.kernel.org/lkml/CALu+AoSZkq1kz-xjvHkkuJ3C71d0SM5ibEJurdgmkZqZvNp2dQ@mail.gmail.com/ [4]
>> Signed-off-by: Kai Huang <kai.huang@...el.com>
>> Suggested-by: Borislav Petkov <bp@...en8.de>
>> Cc: Tom Lendacky <thomas.lendacky@....com>
>> Cc: Dave Young <dyoung@...hat.com>
>> Reviewed-by: Tom Lendacky <thomas.lendacky@....com>
>> ---
>>  arch/x86/kernel/process.c | 21 +++++++++++----------
>>  1 file changed, 11 insertions(+), 10 deletions(-)
>>
>> diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
>> index 9c75d701011f..8475d9d2d8c4 100644
>> --- a/arch/x86/kernel/process.c
>> +++ b/arch/x86/kernel/process.c
>> @@ -819,18 +819,19 @@ void __noreturn stop_this_cpu(void *dummy)
>>  	mcheck_cpu_clear(c);
>>  
>>  	/*
>> -	 * Use wbinvd on processors that support SME. This provides support
>> -	 * for performing a successful kexec when going from SME inactive
>> -	 * to SME active (or vice-versa). The cache must be cleared so that
>> -	 * if there are entries with the same physical address, both with and
>> -	 * without the encryption bit, they don't race each other when flushed
>> -	 * and potentially end up with the wrong entry being committed to
>> -	 * memory.
>> +	 * Use wbinvd to support kexec for both SME (from inactive to active
>> +	 * or vice-versa) and TDX.  The cache must be cleared so that if there
>> +	 * are entries with the same physical address, both with and without
>> +	 * the encryption bit(s), they don't race each other when flushed and
>> +	 * potentially end up with the wrong entry being committed to memory.
>>  	 *
>> -	 * Test the CPUID bit directly because the machine might've cleared
>> -	 * X86_FEATURE_SME due to cmdline options.
>> +	 * stop_this_cpu() isn't a fast path, just do unconditional WBINVD for
>> +	 * bare-metal to cover both SME and TDX.  Do not do WBINVD in a guest
>> +	 * since performing one will result in an exception (#VE or #VC) for
>> +	 * TDX or SEV-ES/SEV-SNP guests which the guest may not be able to
>> +	 * handle (e.g., TDX guest panics if it sees #VE).
>>  	 */
>> -	if (c->extended_cpuid_level >= 0x8000001f && (cpuid_eax(0x8000001f) & BIT(0)))
>> +	if (!boot_cpu_has(X86_FEATURE_HYPERVISOR))
>>  		wbinvd();
> 
> I see that this already has Tom's RB, but I'm not sure how this works for AMD.
> The original SME patch tried to avoid writing to memory by putting the wbinvd
> immediately before the halt, but today it is further away. Below this hunk there
> are more instructions that could dirty memory before the halt.  Ohh... it's new.
> 9 months ago 26ba7353caaa ("x86/smp: Add smp_ops.stop_this_cpu() callback") adds
> a function call that would touch the stack. I think it's wrong? And probably
> introduced after this patch was originally written.
> 
> Then the cpuid_eax() could be non-inlined, but probably not. But the
> boot_cpu_has() added in this patch could call out to kasan and dirty the stack.
> 
> So I think the existing SME case might be theoretically incorrect, and if so
> this makes things very slightly worse.

But the wbinvd() is performed after those checks, so everything gets flushed.

Thanks,
Tom

> 
>>  
>>  	/*
>