linux-kernel - Re: [patch V2 1/8] x86/smp: Make stop_other

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <ZIoYMakfbAU9EOjc@a4bf019067fa.jf.intel.com>
Date:   Wed, 14 Jun 2023 12:42:41 -0700
From:   Ashok Raj <ashok.raj@...el.com>
To:     Thomas Gleixner <tglx@...utronix.de>
CC:     LKML <linux-kernel@...r.kernel.org>, <x86@...nel.org>,
        Mario Limonciello <mario.limonciello@....com>,
        Tom Lendacky <thomas.lendacky@....com>,
        "Tony Battersby" <tonyb@...ernetics.com>,
        Ashok Raj <ashok.raj@...ux.intel.com>,
        Tony Luck <tony.luck@...el.com>,
        Arjan van de Veen <arjan@...ux.intel.com>,
        Eric Biederman <ebiederm@...ssion.com>,
        Ashok Raj <ashok.raj@...el.com>
Subject: Re: [patch V2 1/8] x86/smp: Make stop_other_cpus() more robust

Hi Thomas,

On Tue, Jun 13, 2023 at 02:17:55PM +0200, Thomas Gleixner wrote:
> Tony reported intermittent lockups on poweroff. His analysis identified the
> wbinvd() in stop_this_cpu() as the culprit. This was added to ensure that
> on SME enabled machines a kexec() does not leave any stale data in the
> caches when switching from encrypted to non-encrypted mode or vice versa.
> 
> That wbindv() is conditional on the SME feature bit which is read directly
> from CPUID. But that readout does not check whether the CPUID leaf is
> available or not. If it's not available the CPU will return the value of
> the highest supported leaf instead. Depending on the content the "SME" bit
> might be set or not.
> 
> That's incorrect but harmless. Making the CPUID readout conditional makes
> the observed hangs go away, but it does not fix the underlying problem:
> 
> CPU0					CPU1
> 
>  stop_other_cpus()
>    send_IPIs(REBOOT);			stop_this_cpu()
>    while (num_online_cpus() > 1);         set_online(false);
>    proceed... -> hang
> 				          wbinvd()
> 
> WBINVD is an expensive operation and if multiple CPUs issue it at the same
> time the resulting delays are even larger.

Is this situation similar to what happened with the unexpected wakeups from
mwait_play_dead()? 

i.e the wbinvd() takes a while, and when CPU0 moves ahead, the previous
kernel marches past the wbinvd() instruction since we didn't wait to ensure
this has indeed completed?

native_machine_halt()
{
	machine_shutdown()->stop_other_cpus() 
	stop_this_cpu();<---- Unbalanced atomic_dec()?
}

But the last stop_this_cpu() in native_machine_halt() would 
make the count go negative? Maybe that's OK since no one is waiting for it
to go zero at that point?

> 
> But CPU0 already observed num_online_cpus() going down to 1 and proceeds
> which causes the system to hang.
> 
> Make this more robust by adding a counter which is set to the number of
> online CPUs before sending the IPIs and decremented in stop_this_cpu()
> after the WBINVD completed. Check for that counter in stop_other_cpus()
> instead of watching num_online_cpus().
> 
> Fixes: 08f253ec3767 ("x86/cpu: Clear SME feature flag when not in use")
> Reported-by: Tony Battersby <tonyb@...ernetics.com>
> Signed-off-by: Thomas Gleixner <tglx@...utronix.de>
> Link: https://lore.kernel.org/all/3817d810-e0f1-8ef8-0bbd-663b919ca49b@cybernetics.com
> ---
>  arch/x86/include/asm/cpu.h |    2 ++
>  arch/x86/kernel/process.c  |   10 ++++++++++
>  arch/x86/kernel/smp.c      |   15 ++++++++++++---
>  3 files changed, 24 insertions(+), 3 deletions(-)
> 
> 
> --- a/arch/x86/include/asm/cpu.h
> +++ b/arch/x86/include/asm/cpu.h
> @@ -98,4 +98,6 @@ extern u64 x86_read_arch_cap_msr(void);
>  int intel_find_matching_signature(void *mc, unsigned int csig, int cpf);
>  int intel_microcode_sanity_check(void *mc, bool print_err, int hdr_type);
>  
> +extern atomic_t stop_cpus_count;
> +
>  #endif /* _ASM_X86_CPU_H */
> --- a/arch/x86/kernel/process.c
> +++ b/arch/x86/kernel/process.c
> @@ -759,6 +759,8 @@ bool xen_set_default_idle(void)
>  }
>  #endif
>  
> +atomic_t stop_cpus_count;
> +
>  void __noreturn stop_this_cpu(void *dummy)
>  {
>  	local_irq_disable();
> @@ -783,6 +785,14 @@ void __noreturn stop_this_cpu(void *dumm
>  	 */
>  	if (cpuid_eax(0x8000001f) & BIT(0))
>  		native_wbinvd();
> +
> +	/*
> +	 * native_stop_other_cpus() will write to @stop_cpus_count after
> +	 * observing that it went down to zero, which will invalidate the
> +	 * cacheline on this CPU.
> +	 */
> +	atomic_dec(&stop_cpus_count);
> +
>  	for (;;) {
>  		/*
>  		 * Use native_halt() so that memory contents don't change
> --- a/arch/x86/kernel/smp.c
> +++ b/arch/x86/kernel/smp.c
> @@ -27,6 +27,7 @@
>  #include <asm/mmu_context.h>
>  #include <asm/proto.h>
>  #include <asm/apic.h>
> +#include <asm/cpu.h>
>  #include <asm/idtentry.h>
>  #include <asm/nmi.h>
>  #include <asm/mce.h>
> @@ -171,6 +172,8 @@ static void native_stop_other_cpus(int w
>  		if (atomic_cmpxchg(&stopping_cpu, -1, safe_smp_processor_id()) != -1)
>  			return;
>  
> +		atomic_set(&stop_cpus_count, num_online_cpus() - 1);
> +
>  		/* sync above data before sending IRQ */
>  		wmb();
>  
> @@ -183,12 +186,12 @@ static void native_stop_other_cpus(int w
>  		 * CPUs reach shutdown state.
>  		 */
>  		timeout = USEC_PER_SEC;
> -		while (num_online_cpus() > 1 && timeout--)
> +		while (atomic_read(&stop_cpus_count) > 0 && timeout--)
>  			udelay(1);
>  	}
>  
>  	/* if the REBOOT_VECTOR didn't work, try with the NMI */
> -	if (num_online_cpus() > 1) {
> +	if (atomic_read(&stop_cpus_count) > 0) {
>  		/*
>  		 * If NMI IPI is enabled, try to register the stop handler
>  		 * and send the IPI. In any case try to wait for the other
> @@ -208,7 +211,7 @@ static void native_stop_other_cpus(int w
>  		 * one or more CPUs do not reach shutdown state.
>  		 */
>  		timeout = USEC_PER_MSEC * 10;
> -		while (num_online_cpus() > 1 && (wait || timeout--))
> +		while (atomic_read(&stop_cpus_count) > 0 && (wait || timeout--))
>  			udelay(1);
>  	}

If we go down the INIT path, life is less complicated.. 

After REBOOT_VECTOR IPI, if stop_cpus_count > 0, we send NMI to all CPUs.
Won't this completely update the atomic_dec() since CPUs in hlt() will also
take the NMI correct? I'm not sure if this is problematic.

Or should we reinitialize stop_cpus_count before the NMI hurrah?

>  
> @@ -216,6 +219,12 @@ static void native_stop_other_cpus(int w
>  	disable_local_APIC();
>  	mcheck_cpu_clear(this_cpu_ptr(&cpu_info));
>  	local_irq_restore(flags);
> +
> +	/*
> +	 * Ensure that the cache line is invalidated on the other CPUs. See
> +	 * comment vs. SME in stop_this_cpu().
> +	 */
> +	atomic_set(&stop_cpus_count, INT_MAX);

Didn't understand why INT_MAX here?