linux-kernel - Re: [patch v3 1/7] x86/smp: Make stop_other

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <ZIvByEFqiJZOyau2@a4bf019067fa.jf.intel.com>
Date:   Thu, 15 Jun 2023 18:58:32 -0700
From:   Ashok Raj <ashok.raj@...el.com>
To:     Thomas Gleixner <tglx@...utronix.de>
CC:     LKML <linux-kernel@...r.kernel.org>, <x86@...nel.org>,
        Mario Limonciello <mario.limonciello@....com>,
        Tom Lendacky <thomas.lendacky@....com>,
        "Tony Battersby" <tonyb@...ernetics.com>,
        Ashok Raj <ashok.raj@...ux.intel.com>,
        Tony Luck <tony.luck@...el.com>,
        Arjan van de Veen <arjan@...ux.intel.com>,
        Eric Biederman <ebiederm@...ssion.com>,
        Ashok Raj <ashok.raj@...el.com>
Subject: Re: [patch v3 1/7] x86/smp: Make stop_other_cpus() more robust

Hi Thomas,

On Thu, Jun 15, 2023 at 10:33:50PM +0200, Thomas Gleixner wrote:
> Tony reported intermittent lockups on poweroff. His analysis identified the
> wbinvd() in stop_this_cpu() as the culprit. This was added to ensure that
> on SME enabled machines a kexec() does not leave any stale data in the
> caches when switching from encrypted to non-encrypted mode or vice versa.
> 
> That wbindv() is conditional on the SME feature bit which is read directly
> from CPUID. But that readout does not check whether the CPUID leaf is
> available or not. If it's not available the CPU will return the value of
> the highest supported leaf instead. Depending on the content the "SME" bit
> might be set or not.
> 
> That's incorrect but harmless. Making the CPUID readout conditional makes
> the observed hangs go away, but it does not fix the underlying problem:
> 
> CPU0					CPU1
> 
>  stop_other_cpus()
>    send_IPIs(REBOOT);			stop_this_cpu()
>    while (num_online_cpus() > 1);         set_online(false);
>    proceed... -> hang
> 				          wbinvd()
> 
> WBINVD is an expensive operation and if multiple CPUs issue it at the same
> time the resulting delays are even larger.
> 
> But CPU0 already observed num_online_cpus() going down to 1 and proceeds
> which causes the system to hang.
> 
> This issue exists independent of WBINVD, but the delays caused by WBINVD
> make it more prominent.
> 
> Make this more robust by adding a cpumask which is initialized to the
> online CPU mask before sending the IPIs and CPUs clear their bit in
> stop_this_cpu() after the WBINVD completed. Check for that cpumask to
> become empty in stop_other_cpus() instead of watching num_online_cpus().
> 
> The cpumask cannot plug all holes either, but it's better than a raw
> counter and allows to restrict the NMI fallback IPI to be sent only to
> the CPUs which have not reported within the timeout window.
> 
> Fixes: 08f253ec3767 ("x86/cpu: Clear SME feature flag when not in use")
> Reported-by: Tony Battersby <tonyb@...ernetics.com>
> Signed-off-by: Thomas Gleixner <tglx@...utronix.de>
> Link: https://lore.kernel.org/all/3817d810-e0f1-8ef8-0bbd-663b919ca49b@cybernetics.com
> ---
> V3: Use a cpumask to make the NMI case slightly safer - Ashok
> ---
>  arch/x86/include/asm/cpu.h |    2 +
>  arch/x86/kernel/process.c  |   23 +++++++++++++-
>  arch/x86/kernel/smp.c      |   71 +++++++++++++++++++++++++++++++--------------
>  3 files changed, 73 insertions(+), 23 deletions(-)

I tested them and seems to work fine on my system.

Maybe Tony can check in his setup would be great.

One thought on sending NMI below.

[snip]

>  
>  	/* if the REBOOT_VECTOR didn't work, try with the NMI */
> -	if (num_online_cpus() > 1) {
> +	if (!cpumask_empty(&cpus_stop_mask)) {
>  		/*
>  		 * If NMI IPI is enabled, try to register the stop handler
>  		 * and send the IPI. In any case try to wait for the other
>  		 * CPUs to stop.
>  		 */
>  		if (!smp_no_nmi_ipi && !register_stop_handler()) {
> +			u32 dm;
> +
>  			/* Sync above data before sending IRQ */
>  			wmb();
>  
>  			pr_emerg("Shutting down cpus with NMI\n");
>  
> -			apic_send_IPI_allbutself(NMI_VECTOR);
> +			dm = apic->dest_mode_logical ? APIC_DEST_LOGICAL : APIC_DEST_PHYSICAL;
> +			dm |= APIC_DM_NMI;
> +
> +			for_each_cpu(cpu, &cpus_stop_mask) {
> +				u32 apicid = apic->cpu_present_to_apicid(cpu);
> +
> +				apic_icr_write(dm, apicid);
> +				apic_wait_icr_idle();

can we simplify this by just apic->send_IPI(cpu, NMI_VECTOR); ??