[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <c0b8e728-f775-1023-055c-4f879bbd784a@amd.com>
Date: Fri, 27 Jun 2025 10:08:04 -0500
From: Tom Lendacky <thomas.lendacky@....com>
To: Kai Huang <kai.huang@...el.com>, dave.hansen@...el.com, bp@...en8.de,
tglx@...utronix.de, peterz@...radead.org, mingo@...hat.com, hpa@...or.com
Cc: x86@...nel.org, kirill.shutemov@...ux.intel.com,
rick.p.edgecombe@...el.com, linux-kernel@...r.kernel.org,
pbonzini@...hat.com, seanjc@...gle.com, kvm@...r.kernel.org,
reinette.chatre@...el.com, isaku.yamahata@...el.com,
dan.j.williams@...el.com, ashish.kalra@....com, nik.borisov@...e.com,
sagis@...gle.com
Subject: Re: [PATCH v3 1/6] x86/sme: Use percpu boolean to control wbinvd
during kexec
On 6/26/25 05:48, Kai Huang wrote:
> On SME platforms, at hardware level dirty cachelines with and without
> encryption bit of the same memory can coexist, and the CPU can flush
> them back to memory in random order. During kexec, the caches must be
> flushed before jumping to the new kernel to avoid silent memory
> corruption when a cacheline with a different encryption property is
> written back over whatever encryption properties the new kernel is
> using.
>
> TDX also needs cache flush during kexec for the same reason. It would
> be good to implement a generic way to flush cache instead of scattering
> checks for each feature all around.
>
> During kexec, the kexec-ing CPU sends IPIs to all remote CPUs to stop
> them before it boots to the new kernel. For SME the kernel basically
> encrypts all memory including the kernel itself by manipulating page
> tables. A simple memory write from the kernel could dirty cachelines.
>
> Currently, the kernel uses WBINVD to flush cache for SME during kexec in
> two places:
>
> 1) the one in stop_this_cpu() for all remote CPUs when the kexec-ing CPU
> stops them;
> 2) the one in the relocate_kernel() where the kexec-ing CPU jumps to the
> new kernel.
>
> Unlike SME, TDX can only dirty cachelines when it is used (i.e., when
> SEAMCALLs are performed). Since there are no more SEAMCALLs after the
> aforementioned WBINVDs, leverage this for TDX.
>
> In order to have a generic way to cover both SME and TDX, use a percpu
> boolean to indicate the cache may be in an incoherent state thus cache
> flush is needed during kexec, and turn on the boolean for SME. TDX can
> then leverage it by also turning the boolean on.
>
> A percpu boolean isn't strictly needed for SME since it is activated at
> very early boot time and on all CPUs. A global flag would be
> sufficient. But using a percpu variable has two benefits. Foremost,
> the state that is being tracked here (percpu cache coherency situation
> requiring a flush) is percpu, so a percpu state is a more direct and
> natural fit.
>
> Secondly, it will fit TDX's usage better since the percpu var can be set
> when a CPU makes a SEAMCALL, and cleared when another WBINVD on the CPU
> obviates the need for a kexec-time WBINVD. Saving kexec-time WBINVD is
> valuable, because there is an existing race[*] where kexec could proceed
> while another CPU is active. WBINVD could make this race worse, so it's
> worth skipping it when possible.
>
> Today the first WBINVD in the stop_this_cpu() is performed when SME is
> *supported* by the platform, and the second WBINVD is done in
> relocate_kernel() when SME is *activated* by the kernel. Make things
> simple by changing to do the second WBINVD when the platform supports
> SME. This allows the kernel to simply turn on this percpu boolean when
> bringing up a CPU by checking whether the platform supports SME.
>
> No other functional change intended.
>
> Also, currently machine_check() has a comment to explain why no function
> call is allowed after load_segments(). After changing to use the new
> percpu boolean to control whether to perform WBINVD when calling the
> relocate_kernel(), that comment isn't needed anymore. But it is still a
> useful comment, so expand the comment around load_segments() to mention
> that due to depth tracking no function call can be made after
> load_segments().
>
> [*] The "race" in native_stop_other_cpus()
>
> Commit
>
> 1f5e7eb7868e: ("x86/smp: Make stop_other_cpus() more robust")
>
> introduced a new 'cpus_stop_mask' to resolve an "intermittent lockups on
> poweroff" issue which was caused by the WBINVD in stop_this_cpu().
> Specifically, the new cpumask resolved the below problem mentioned in
> that commit:
>
> CPU0 CPU1
>
> stop_other_cpus()
> send_IPIs(REBOOT); stop_this_cpu()
> while (num_online_cpus() > 1); set_online(false);
> proceed... -> hang
> wbinvd()
>
> While it fixed the reported issue, that commit explained the new cpumask
> "cannot plug all holes either".
>
> This is because it doesn't address the "race" mentioned in the #3 in the
> comment in native_stop_other_cpus():
>
> /*
> * 1) Send an IPI on the reboot vector to all other CPUs.
> *
> * The other CPUs should react on it after leaving critical
> * sections and re-enabling interrupts. They might still hold
> * locks, but there is nothing which can be done about that.
> *
> * 2) Wait for all other CPUs to report that they reached the
> * HLT loop in stop_this_cpu()
> *
> * 3) If #2 timed out send an NMI to the CPUs which did not
> * yet report
> *
> * 4) Wait for all other CPUs to report that they reached the
> * HLT loop in stop_this_cpu()
> *
> * #3 can obviously race against a CPU reaching the HLT loop late.
> * That CPU will have reported already and the "have all CPUs
> * reached HLT" condition will be true despite the fact that the
> * other CPU is still handling the NMI. Again, there is no
> * protection against that as "disabled" APICs still respond to
> * NMIs.
> */
>
> Consider below case:
>
> CPU 0 CPU 1
>
> native_stop_other_cpus() stop_this_cpu()
>
> // sends REBOOT IPI to stop remote CPUs
> ...
> wbinvd();
>
> // wait timesout, try NMI
> if (!cpumask_empty(&cpus_stop_mask)) {
> for_each_cpu(cpu, &cpus_stop_mask) {
>
> ...
> cpumask_clear_cpu(cpu,
> &cpus_stop_mask);
> hlt;
>
> // send NMI --->
> wakeup from hlt and run
> stop_this_cpu():
>
> // WAIT CPUs TO STOP
> while (!cpumask_empty(
> &cpus_stop_mask) &&
> ...)
> }
> ...
> proceed ... wbinvd();
> ...
> hlt;
>
> The "WAIT CPUs TO STOP" is supposed to wait until all remote CPUs are
> stopped, but actually it quits immediately because the remote CPUs have
> been cleared in cpus_stop_mask when stop_this_cpu() is called from the
> REBOOT IPI.
>
> Doing WBINVD in stop_this_cpu() could potentially increase the chance to
> trigger the above "race" despite it's still rare to happen.
>
> Signed-off-by: Kai Huang <kai.huang@...el.com>
Reviewed-by: Tom Lendacky <thomas.lendacky@....com>
Tested-by: Tom Lendacky <thomas.lendacky@....com>
> ---
> arch/x86/include/asm/kexec.h | 2 +-
> arch/x86/include/asm/processor.h | 2 ++
> arch/x86/kernel/cpu/amd.c | 16 ++++++++++++++++
> arch/x86/kernel/machine_kexec_64.c | 15 ++++++++++-----
> arch/x86/kernel/process.c | 16 +++-------------
> arch/x86/kernel/relocate_kernel_64.S | 15 +++++++++++----
> 6 files changed, 43 insertions(+), 23 deletions(-)
>
Powered by blists - more mailing lists