linux-kernel - Re: [PATCH v3 1/6] x86/sme: Use percpu boolean to control wbinvd during kexec

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <c0b8e728-f775-1023-055c-4f879bbd784a@amd.com>
Date: Fri, 27 Jun 2025 10:08:04 -0500
From: Tom Lendacky <thomas.lendacky@....com>
To: Kai Huang <kai.huang@...el.com>, dave.hansen@...el.com, bp@...en8.de,
 tglx@...utronix.de, peterz@...radead.org, mingo@...hat.com, hpa@...or.com
Cc: x86@...nel.org, kirill.shutemov@...ux.intel.com,
 rick.p.edgecombe@...el.com, linux-kernel@...r.kernel.org,
 pbonzini@...hat.com, seanjc@...gle.com, kvm@...r.kernel.org,
 reinette.chatre@...el.com, isaku.yamahata@...el.com,
 dan.j.williams@...el.com, ashish.kalra@....com, nik.borisov@...e.com,
 sagis@...gle.com
Subject: Re: [PATCH v3 1/6] x86/sme: Use percpu boolean to control wbinvd
 during kexec

On 6/26/25 05:48, Kai Huang wrote:
> On SME platforms, at hardware level dirty cachelines with and without
> encryption bit of the same memory can coexist, and the CPU can flush
> them back to memory in random order.  During kexec, the caches must be
> flushed before jumping to the new kernel to avoid silent memory
> corruption when a cacheline with a different encryption property is
> written back over whatever encryption properties the new kernel is
> using.
> 
> TDX also needs cache flush during kexec for the same reason.  It would
> be good to implement a generic way to flush cache instead of scattering
> checks for each feature all around.
> 
> During kexec, the kexec-ing CPU sends IPIs to all remote CPUs to stop
> them before it boots to the new kernel.  For SME the kernel basically
> encrypts all memory including the kernel itself by manipulating page
> tables.  A simple memory write from the kernel could dirty cachelines.
> 
> Currently, the kernel uses WBINVD to flush cache for SME during kexec in
> two places:
> 
> 1) the one in stop_this_cpu() for all remote CPUs when the kexec-ing CPU
>    stops them;
> 2) the one in the relocate_kernel() where the kexec-ing CPU jumps to the
>    new kernel.
> 
> Unlike SME, TDX can only dirty cachelines when it is used (i.e., when
> SEAMCALLs are performed).  Since there are no more SEAMCALLs after the
> aforementioned WBINVDs, leverage this for TDX.
> 
> In order to have a generic way to cover both SME and TDX, use a percpu
> boolean to indicate the cache may be in an incoherent state thus cache
> flush is needed during kexec, and turn on the boolean for SME.  TDX can
> then leverage it by also turning the boolean on.
> 
> A percpu boolean isn't strictly needed for SME since it is activated at
> very early boot time and on all CPUs.  A global flag would be
> sufficient.  But using a percpu variable has two benefits.  Foremost,
> the state that is being tracked here (percpu cache coherency situation
> requiring a flush) is percpu, so a percpu state is a more direct and
> natural fit.
> 
> Secondly, it will fit TDX's usage better since the percpu var can be set
> when a CPU makes a SEAMCALL, and cleared when another WBINVD on the CPU
> obviates the need for a kexec-time WBINVD.  Saving kexec-time WBINVD is
> valuable, because there is an existing race[*] where kexec could proceed
> while another CPU is active.  WBINVD could make this race worse, so it's
> worth skipping it when possible.
> 
> Today the first WBINVD in the stop_this_cpu() is performed when SME is
> *supported* by the platform, and the second WBINVD is done in
> relocate_kernel() when SME is *activated* by the kernel.  Make things
> simple by changing to do the second WBINVD when the platform supports
> SME.  This allows the kernel to simply turn on this percpu boolean when
> bringing up a CPU by checking whether the platform supports SME.
> 
> No other functional change intended.
> 
> Also, currently machine_check() has a comment to explain why no function
> call is allowed after load_segments().  After changing to use the new
> percpu boolean to control whether to perform WBINVD when calling the
> relocate_kernel(), that comment isn't needed anymore.  But it is still a
> useful comment, so expand the comment around load_segments() to mention
> that due to depth tracking no function call can be made after
> load_segments().
> 
> [*] The "race" in native_stop_other_cpus()
> 
> Commit
> 
>   1f5e7eb7868e: ("x86/smp: Make stop_other_cpus() more robust")
> 
> introduced a new 'cpus_stop_mask' to resolve an "intermittent lockups on
> poweroff" issue which was caused by the WBINVD in stop_this_cpu().
> Specifically, the new cpumask resolved the below problem mentioned in
> that commit:
> 
>     CPU0                                    CPU1
> 
>      stop_other_cpus()
>        send_IPIs(REBOOT);                   stop_this_cpu()
>        while (num_online_cpus() > 1);         set_online(false);
>        proceed... -> hang
>                                               wbinvd()
> 
> While it fixed the reported issue, that commit explained the new cpumask
> "cannot plug all holes either".
> 
> This is because it doesn't address the "race" mentioned in the #3 in the
> comment in native_stop_other_cpus():
> 
>         /*
>          * 1) Send an IPI on the reboot vector to all other CPUs.
>          *
>          *    The other CPUs should react on it after leaving critical
>          *    sections and re-enabling interrupts. They might still hold
>          *    locks, but there is nothing which can be done about that.
>          *
>          * 2) Wait for all other CPUs to report that they reached the
>          *    HLT loop in stop_this_cpu()
>          *
>          * 3) If #2 timed out send an NMI to the CPUs which did not
>          *    yet report
>          *
>          * 4) Wait for all other CPUs to report that they reached the
>          *    HLT loop in stop_this_cpu()
>          *
>          * #3 can obviously race against a CPU reaching the HLT loop late.
>          * That CPU will have reported already and the "have all CPUs
>          * reached HLT" condition will be true despite the fact that the
>          * other CPU is still handling the NMI. Again, there is no
>          * protection against that as "disabled" APICs still respond to
>          * NMIs.
>          */
> 
> Consider below case:
> 
>     CPU 0					CPU 1
> 
>     native_stop_other_cpus()			stop_this_cpu()
> 
> 	// sends REBOOT IPI to stop remote CPUs
> 						...
> 						wbinvd();
> 
> 	// wait timesout, try NMI
> 	if (!cpumask_empty(&cpus_stop_mask)) {
> 		for_each_cpu(cpu, &cpus_stop_mask) {
> 
> 						...
> 						cpumask_clear_cpu(cpu,
> 							&cpus_stop_mask);
> 						hlt;
> 
> 			// send NMI 	--->
> 						wakeup from hlt and run
> 						stop_this_cpu():
> 
> 			// WAIT CPUs TO STOP
> 			while (!cpumask_empty(
> 			    &cpus_stop_mask) &&
> 				...)
> 		}
> 						...
> 		proceed ...			wbinvd();
> 						...
> 						hlt;
> 
> The "WAIT CPUs TO STOP" is supposed to wait until all remote CPUs are
> stopped, but actually it quits immediately because the remote CPUs have
> been cleared in cpus_stop_mask when stop_this_cpu() is called from the
> REBOOT IPI.
> 
> Doing WBINVD in stop_this_cpu() could potentially increase the chance to
> trigger the above "race" despite it's still rare to happen.
> 
> Signed-off-by: Kai Huang <kai.huang@...el.com>

Reviewed-by: Tom Lendacky <thomas.lendacky@....com>
Tested-by: Tom Lendacky <thomas.lendacky@....com>

> ---
>  arch/x86/include/asm/kexec.h         |  2 +-
>  arch/x86/include/asm/processor.h     |  2 ++
>  arch/x86/kernel/cpu/amd.c            | 16 ++++++++++++++++
>  arch/x86/kernel/machine_kexec_64.c   | 15 ++++++++++-----
>  arch/x86/kernel/process.c            | 16 +++-------------
>  arch/x86/kernel/relocate_kernel_64.S | 15 +++++++++++----
>  6 files changed, 43 insertions(+), 23 deletions(-)
>