linux-kernel - Re: [RFC PATCH 1/5] x86/kexec: Do unconditional WBINVD for bare-metal in stop_this

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <324a342417a1455633d4646b4be199aa6d85509d.camel@intel.com>
Date: Mon, 17 Mar 2025 10:11:46 +0000
From: "Huang, Kai" <kai.huang@...el.com>
To: "tglx@...utronix.de" <tglx@...utronix.de>, "peterz@...radead.org"
	<peterz@...radead.org>, "mingo@...hat.com" <mingo@...hat.com>, "Hansen, Dave"
	<dave.hansen@...el.com>, "Edgecombe, Rick P" <rick.p.edgecombe@...el.com>,
	"bp@...en8.de" <bp@...en8.de>, "kirill.shutemov@...ux.intel.com"
	<kirill.shutemov@...ux.intel.com>
CC: "dwmw@...zon.co.uk" <dwmw@...zon.co.uk>, "dyoung@...hat.com"
	<dyoung@...hat.com>, "seanjc@...gle.com" <seanjc@...gle.com>,
	"x86@...nel.org" <x86@...nel.org>, "linux-kernel@...r.kernel.org"
	<linux-kernel@...r.kernel.org>, "sagis@...gle.com" <sagis@...gle.com>,
	"hpa@...or.com" <hpa@...or.com>, "Chatre, Reinette"
	<reinette.chatre@...el.com>, "bhe@...hat.com" <bhe@...hat.com>, "Williams,
 Dan J" <dan.j.williams@...el.com>, "pbonzini@...hat.com"
	<pbonzini@...hat.com>, "thomas.lendacky@....com" <thomas.lendacky@....com>,
	"nik.borisov@...e.com" <nik.borisov@...e.com>, "ashish.kalra@....com"
	<ashish.kalra@....com>, "Yamahata, Isaku" <isaku.yamahata@...el.com>
Subject: Re: [RFC PATCH 1/5] x86/kexec: Do unconditional WBINVD for bare-metal
 in stop_this_cpu()

On Thu, 2025-03-13 at 18:40 +0000, Edgecombe, Rick P wrote:
> On Thu, 2025-03-13 at 00:34 +1300, Kai Huang wrote:
> > TL;DR:
> > 
> > Change to do unconditional WBINVD in stop_this_cpu() for bare metal to
> > cover kexec support for both AMD SME and Intel TDX.  Previously there
> > _was_ some issue preventing from doing so but now it has been fixed.
>                              ^ Adding "the kernel" here would read a little
> cleaner to me.

OK.

> 
> 
> When I read "some issue" I start assuming it wasn't fully debugged and it is
> "some issue" that no one knows. But below it sounds like it was root caused.

I am not sure what's wrong with "some issue".  I used "_was_" to convey this
issue was fixed (thus root caused).  Perhaps the "root caused" part isn't clear?
I can explicitly call it out.

  Previously there _was_ some issue preventing the kernel from doing so but 
  now it has been root caused and fixed. 

> 
> > Long version:
> 
> It might make this easier to digest this long version if it start with a "tell
> them what you are going to tell them" paragraph.
> 
> > 
> > AMD SME uses the C-bit to determine whether to encrypt the memory or
> > not.  For the same physical memory address, dirty cachelines with and
> > without the C-bit can coexist and the CPU can flush them back to memory
> > in random order.  To support kexec for SME, the old kernel uses WBINVD
> > to flush cache before booting to the new kernel so that no stale dirty
> > cacheline are left over by the old kernel which could otherwise corrupt
> > the new kernel's memory.
> > 
> > TDX uses 'KeyID' bits in the physical address for memory encryption and
> > has the same requirement.  To support kexec for TDX, the old kernel
> > needs to flush cache of TDX private memory.
> 
> This paragraph is a little jarring because it's not clear how it follows from
> the first paragraph. It helps the reader to give some hints on how they should
> organize the information as they go along. If it's too much of an info dump, it
> puts an extra burden. They have to try to hold all of the facts in their head
> until they can put together the bigger picture themselves.
> 
> The extra info about TDX using KeyID also seems unnecessary to the point (IIUC).

I added the above two paragraphs to mainly address Reinette's concern regarding
"cache can be in coherent status due to memory encryption" being too vague.

I also think they are too verbose.  How about placing the first two paragraphs
with what I have (after addressing your comments) in the patch 2:

 For SME and TDX, dirty cachelines with and without encryption bit(s) of 
 the same memory can coexist, and the CPU can flush them back to memory 
 in random order.

?

> 
> > 
> > Currently, the kernel only performs WBINVD in stop_this_cpu() when SME
> > is supported by hardware.  Perform unconditional WBINVD to support TDX
> > instead of adding one more vendor-specific check.  Kexec is a slow path,
> > and the additional WBINVD is acceptable for the sake of simplicity and
> > maintainability.
> 
> Out of curiosity, do you know why this was not already needed for non-self snoop
> CPUs? 
> 

Self snooping only deals with different memory types (UC, WB etc) but doesn't
handle memory encryption which with additional bit(s) tagged into the physical
address. 

> Why can't there be other cache modes that get written back after the new
> kernel starts using the memory for something else?

I assume teaching cache coherent protocol to understand the additional
encryption bit(s) isn't something that can be supported easily for all vendors.

> 
> > 
> > Only do WBINVD on bare-metal.  Doing WBINVD in guest triggers unexpected
>                                                 ^the

OK.

> > exception (#VE or #VC) for TDX and SEV-ES/SEV-SNP guests and the guest
> > may not be able to handle such exception (e.g., TDX guests panics if it
>                                                              ^panic

Ah good catch.

> > sees such #VE).
> 
> It's a small thing, but I think you could skip the #VE or #VC info in here. All
> they need to know to understand this patch is that TDX and some SEV kernels
> cannot handle WBINVD. And TDX panics. (does SEV not?)

I can remove "(#VE or #VC)", unless Kirill/Tom object.

SEV does not panic here in stop_this_cpu(), but it does panic later in
relocate_kernel():

https://lore.kernel.org/lkml/e1d37efb8951eb1d38493687b10a21b23353e35a.1710811610.git.kai.huang@intel.com/t/#m13e38c50f93afbf1c15f506e96817b2a78444b1d

> 
> > 
> > History of SME and kexec WBINVD:
> > 
> > There _was_ an issue preventing doing unconditional WBINVD but that has
> > been fixed.
> > 
> > Initial SME kexec support added an unconditional WBINVD in commit
> > 
> >   bba4ed011a52: ("x86/mm, kexec: Allow kexec to be used with SME")
> > 
> > This commit caused different Intel systems to hang or reset.
> > 
> > Without a clear root cause, a later commit
> > 
> >   f23d74f6c66c: ("x86/mm: Rework wbinvd, hlt operation in stop_this_cpu()")
> > 
> > fixed the Intel system hang issues by only doing WBINVD when hardware
> > supports SME.
> > 
> > A corner case [*] revealed the root cause of the system hang issues and
> > was fixed by commit
> > 
> >   1f5e7eb7868e: ("x86/smp: Make stop_other_cpus() more robust")
> > 
> > See [1][2] for more information.
> > 
> > Further testing of doing unconditional WBINVD based on the above fix on
> > the problematic machines (that issues were originally reported)
> > confirmed the issues couldn't be reproduced.
> > 
> > See [3][4] for more information.
> > 
> > Therefore, it is safe to do unconditional WBINVD for bare-metal now.
> 
> Instead of a play-by-play, it might be more informative to summarize the edges
> covered in this history:

I think the above is also informative.  Boris suggested to keep the discussion
around those relevant commits in the changelog:

https://lore.kernel.org/linux-kernel/20240228110207.GCZd8Sr8mXHA2KTiLz@fat_crate.local/
https://lore.kernel.org/linux-kernel/20240415175912.GAZh1q8GgpY3tpJj5a@fat_crate.local/

>  - Don't do anything that writes memory between wbinvd and native_halt(). This
> includes function calls that touch the stack.

This is a requirement to SME, but changing to unconditional WBINVD on bare-metal
doesn't change this behaviour.  What's the purpose of mentioning here?

>  - Avoid issuing wbinvd on multiple CPUs at the same time. As tglx implies it is
> too expensive.

Boris suggested to do unconditional WBINVD.  The test by Dave Young also
confirms there was no issue on the once-problematic platforms:

https://lore.kernel.org/lkml/CALu+AoSZkq1kz-xjvHkkuJ3C71d0SM5ibEJurdgmkZqZvNp2dQ@mail.gmail.com/

>  - Don't race reboot by watching cpumask instead of num_online_cpus(). But there
> is a race still.
> 
> Hmm, on the last one tglx says:
>     The cpumask cannot plug all holes either, but it's better than a raw
>     counter and allows to restrict the NMI fallback IPI to be sent only the
>     CPUs which have not reported within the timeout window
> 
> Are we opening up more platforms to a race by unconditionally doing the wbinvd?
> Can we be clarify that nothing bad happens if we lose the race? (and is it
> true?)

IIUC in most cases the REBOOT vector will just do the job, stop_other_cpus()
won't need to send NMIs thus I believe in most cases this shouldn't increase
race.  

I don't know what kinda platform will need NMI to stop remote CPUs.  For
instance, AFAICT my SPR machine with 192 CPUs never needed to send NMI in my
testings.

Dave Yong tested on those once-problematic platforms and doing unconditional
wbinvd was fine.  This patchset (albeit not upstreamed) has also been tested by
customers in their environment.  I believe doing unconditional WBINVD is fine.