[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <e7adcb16-ff6c-d901-155b-866be4de2d40@amd.com>
Date: Fri, 14 Mar 2025 10:11:59 -0500
From: Tom Lendacky <thomas.lendacky@....com>
To: "Edgecombe, Rick P" <rick.p.edgecombe@...el.com>,
"tglx@...utronix.de" <tglx@...utronix.de>,
"peterz@...radead.org" <peterz@...radead.org>,
"mingo@...hat.com" <mingo@...hat.com>, "Hansen, Dave"
<dave.hansen@...el.com>, "Huang, Kai" <kai.huang@...el.com>,
"bp@...en8.de" <bp@...en8.de>,
"kirill.shutemov@...ux.intel.com" <kirill.shutemov@...ux.intel.com>
Cc: "nik.borisov@...e.com" <nik.borisov@...e.com>,
"bhe@...hat.com" <bhe@...hat.com>, "seanjc@...gle.com" <seanjc@...gle.com>,
"x86@...nel.org" <x86@...nel.org>, "dyoung@...hat.com" <dyoung@...hat.com>,
"sagis@...gle.com" <sagis@...gle.com>, "hpa@...or.com" <hpa@...or.com>,
"Chatre, Reinette" <reinette.chatre@...el.com>,
"Williams, Dan J" <dan.j.williams@...el.com>,
"pbonzini@...hat.com" <pbonzini@...hat.com>,
"ashish.kalra@....com" <ashish.kalra@....com>,
"Yamahata, Isaku" <isaku.yamahata@...el.com>,
"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
"dwmw@...zon.co.uk" <dwmw@...zon.co.uk>
Subject: Re: [RFC PATCH 1/5] x86/kexec: Do unconditional WBINVD for bare-metal
in stop_this_cpu()
On 3/13/25 13:40, Edgecombe, Rick P wrote:
> On Thu, 2025-03-13 at 00:34 +1300, Kai Huang wrote:
>> TL;DR:
>>
>> Change to do unconditional WBINVD in stop_this_cpu() for bare metal to
>> cover kexec support for both AMD SME and Intel TDX. Previously there
>> _was_ some issue preventing from doing so but now it has been fixed.
> ^ Adding "the kernel" here would read a little
> cleaner to me.
>
>
> When I read "some issue" I start assuming it wasn't fully debugged and it is
> "some issue" that no one knows. But below it sounds like it was root caused.
>
>> Long version:
>
> It might make this easier to digest this long version if it start with a "tell
> them what you are going to tell them" paragraph.
>
>>
>> AMD SME uses the C-bit to determine whether to encrypt the memory or
>> not. For the same physical memory address, dirty cachelines with and
>> without the C-bit can coexist and the CPU can flush them back to memory
>> in random order. To support kexec for SME, the old kernel uses WBINVD
>> to flush cache before booting to the new kernel so that no stale dirty
>> cacheline are left over by the old kernel which could otherwise corrupt
>> the new kernel's memory.
>>
>> TDX uses 'KeyID' bits in the physical address for memory encryption and
>> has the same requirement. To support kexec for TDX, the old kernel
>> needs to flush cache of TDX private memory.
>
> This paragraph is a little jarring because it's not clear how it follows from
> the first paragraph. It helps the reader to give some hints on how they should
> organize the information as they go along. If it's too much of an info dump, it
> puts an extra burden. They have to try to hold all of the facts in their head
> until they can put together the bigger picture themselves.
>
> The extra info about TDX using KeyID also seems unnecessary to the point (IIUC).
>
>>
>> Currently, the kernel only performs WBINVD in stop_this_cpu() when SME
>> is supported by hardware. Perform unconditional WBINVD to support TDX
>> instead of adding one more vendor-specific check. Kexec is a slow path,
>> and the additional WBINVD is acceptable for the sake of simplicity and
>> maintainability.
>
> Out of curiosity, do you know why this was not already needed for non-self snoop
> CPUs? Why can't there be other cache modes that get written back after the new
> kernel starts using the memory for something else?
>
>>
>> Only do WBINVD on bare-metal. Doing WBINVD in guest triggers unexpected
> ^the
>> exception (#VE or #VC) for TDX and SEV-ES/SEV-SNP guests and the guest
>> may not be able to handle such exception (e.g., TDX guests panics if it
> ^panic
>> sees such #VE).
>
> It's a small thing, but I think you could skip the #VE or #VC info in here. All
> they need to know to understand this patch is that TDX and some SEV kernels
> cannot handle WBINVD. And TDX panics. (does SEV not?)
>
>>
>> History of SME and kexec WBINVD:
>>
>> There _was_ an issue preventing doing unconditional WBINVD but that has
>> been fixed.
>>
>> Initial SME kexec support added an unconditional WBINVD in commit
>>
>> bba4ed011a52: ("x86/mm, kexec: Allow kexec to be used with SME")
>>
>> This commit caused different Intel systems to hang or reset.
>>
>> Without a clear root cause, a later commit
>>
>> f23d74f6c66c: ("x86/mm: Rework wbinvd, hlt operation in stop_this_cpu()")
>>
>> fixed the Intel system hang issues by only doing WBINVD when hardware
>> supports SME.
>>
>> A corner case [*] revealed the root cause of the system hang issues and
>> was fixed by commit
>>
>> 1f5e7eb7868e: ("x86/smp: Make stop_other_cpus() more robust")
>>
>> See [1][2] for more information.
>>
>> Further testing of doing unconditional WBINVD based on the above fix on
>> the problematic machines (that issues were originally reported)
>> confirmed the issues couldn't be reproduced.
>>
>> See [3][4] for more information.
>>
>> Therefore, it is safe to do unconditional WBINVD for bare-metal now.
>
> Instead of a play-by-play, it might be more informative to summarize the edges
> covered in this history:
> - Don't do anything that writes memory between wbinvd and native_halt(). This
> includes function calls that touch the stack.
> - Avoid issuing wbinvd on multiple CPUs at the same time. As tglx implies it is
> too expensive.
> - Don't race reboot by watching cpumask instead of num_online_cpus(). But there
> is a race still.
>
> Hmm, on the last one tglx says:
> The cpumask cannot plug all holes either, but it's better than a raw
> counter and allows to restrict the NMI fallback IPI to be sent only the
> CPUs which have not reported within the timeout window
>
> Are we opening up more platforms to a race by unconditionally doing the wbinvd?
> Can we be clarify that nothing bad happens if we lose the race? (and is it
> true?)
>
>>
>> [*] The commit didn't check whether the CPUID leaf is available or not.
>> Making unsupported CPUID leaf on Intel returns garbage resulting in
>> unintended WBINVD which caused some issue (followed by the analysis and
>> the reveal of the final root cause). The corner case was independently
>> fixed by commit
>>
>> 9b040453d444: ("x86/smp: Dont access non-existing CPUID leaf")
>>
>> Link: https://lore.kernel.org/lkml/28a494ca-3173-4072-921c-6c5f5b257e79@amd.com/ [1]
>> Link: https://lore.kernel.org/lkml/24844584-8031-4b58-ba5c-f85ef2f4c718@amd.com/ [2]
>> Link: https://lore.kernel.org/lkml/20240221092856.GAZdXCWGJL7c9KLewv@fat_crate.local/ [3]
>> Link: https://lore.kernel.org/lkml/CALu+AoSZkq1kz-xjvHkkuJ3C71d0SM5ibEJurdgmkZqZvNp2dQ@mail.gmail.com/ [4]
>> Signed-off-by: Kai Huang <kai.huang@...el.com>
>> Suggested-by: Borislav Petkov <bp@...en8.de>
>> Cc: Tom Lendacky <thomas.lendacky@....com>
>> Cc: Dave Young <dyoung@...hat.com>
>> Reviewed-by: Tom Lendacky <thomas.lendacky@....com>
>> ---
>> arch/x86/kernel/process.c | 21 +++++++++++----------
>> 1 file changed, 11 insertions(+), 10 deletions(-)
>>
>> diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
>> index 9c75d701011f..8475d9d2d8c4 100644
>> --- a/arch/x86/kernel/process.c
>> +++ b/arch/x86/kernel/process.c
>> @@ -819,18 +819,19 @@ void __noreturn stop_this_cpu(void *dummy)
>> mcheck_cpu_clear(c);
>>
>> /*
>> - * Use wbinvd on processors that support SME. This provides support
>> - * for performing a successful kexec when going from SME inactive
>> - * to SME active (or vice-versa). The cache must be cleared so that
>> - * if there are entries with the same physical address, both with and
>> - * without the encryption bit, they don't race each other when flushed
>> - * and potentially end up with the wrong entry being committed to
>> - * memory.
>> + * Use wbinvd to support kexec for both SME (from inactive to active
>> + * or vice-versa) and TDX. The cache must be cleared so that if there
>> + * are entries with the same physical address, both with and without
>> + * the encryption bit(s), they don't race each other when flushed and
>> + * potentially end up with the wrong entry being committed to memory.
>> *
>> - * Test the CPUID bit directly because the machine might've cleared
>> - * X86_FEATURE_SME due to cmdline options.
>> + * stop_this_cpu() isn't a fast path, just do unconditional WBINVD for
>> + * bare-metal to cover both SME and TDX. Do not do WBINVD in a guest
>> + * since performing one will result in an exception (#VE or #VC) for
>> + * TDX or SEV-ES/SEV-SNP guests which the guest may not be able to
>> + * handle (e.g., TDX guest panics if it sees #VE).
>> */
>> - if (c->extended_cpuid_level >= 0x8000001f && (cpuid_eax(0x8000001f) & BIT(0)))
>> + if (!boot_cpu_has(X86_FEATURE_HYPERVISOR))
>> wbinvd();
>
> I see that this already has Tom's RB, but I'm not sure how this works for AMD.
> The original SME patch tried to avoid writing to memory by putting the wbinvd
> immediately before the halt, but today it is further away. Below this hunk there
> are more instructions that could dirty memory before the halt. Ohh... it's new.
> 9 months ago 26ba7353caaa ("x86/smp: Add smp_ops.stop_this_cpu() callback") adds
> a function call that would touch the stack. I think it's wrong? And probably
> introduced after this patch was originally written.
>
> Then the cpuid_eax() could be non-inlined, but probably not. But the
> boot_cpu_has() added in this patch could call out to kasan and dirty the stack.
>
> So I think the existing SME case might be theoretically incorrect, and if so
> this makes things very slightly worse.
But the wbinvd() is performed after those checks, so everything gets flushed.
Thanks,
Tom
>
>>
>> /*
>
Powered by blists - more mailing lists