linux-kernel - Re: [PATCH] arm64: smp: smp_send_stop() and crash_smp_send

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAD=FV=V_TGvRgZy9uFzF_tGX25oYzVrjHRrg-CphwmhmJRwsKg@mail.gmail.com>
Date: Fri, 17 May 2024 13:01:58 -0700
From: Doug Anderson <dianders@...omium.org>
To: Catalin Marinas <catalin.marinas@....com>, Will Deacon <will@...nel.org>
Cc: Mark Rutland <mark.rutland@....com>, Marc Zyngier <maz@...nel.org>, 
	Misono Tomohiro <misono.tomohiro@...itsu.com>, Chen-Yu Tsai <wens@...e.org>, 
	Stephen Boyd <swboyd@...omium.org>, Daniel Thompson <daniel.thompson@...aro.org>, 
	Sumit Garg <sumit.garg@...aro.org>, Frederic Weisbecker <frederic@...nel.org>, 
	"Guilherme G. Piccoli" <gpiccoli@...lia.com>, Josh Poimboeuf <jpoimboe@...nel.org>, 
	Kees Cook <keescook@...omium.org>, Peter Zijlstra <peterz@...radead.org>, 
	Thomas Gleixner <tglx@...utronix.de>, Tony Luck <tony.luck@...el.com>, 
	Valentin Schneider <vschneid@...hat.com>, linux-arm-kernel@...ts.infradead.org, 
	linux-hardening@...r.kernel.org, linux-kernel@...r.kernel.org
Subject: Re: [PATCH] arm64: smp: smp_send_stop() and crash_smp_send_stop()
 should try non-NMI first

Hi,

On Thu, Dec 7, 2023 at 5:03 PM Douglas Anderson <dianders@...omium.org> wrote:
>
> When testing hard lockup handling on my sc7180-trogdor-lazor device
> with pseudo-NMI enabled, with serial console enabled and with kgdb
> disabled, I found that the stack crawls printed to the serial console
> ended up as a jumbled mess. After rebooting, the pstore-based console
> looked fine though. Also, enabling kgdb to trap the panic made the
> console look fine and avoided the mess.
>
> After a bit of tracking down, I came to the conclusion that this was
> what was happening:
> 1. The panic path was stopping all other CPUs with
>    panic_other_cpus_shutdown().
> 2. At least one of those other CPUs was in the middle of printing to
>    the serial console and holding the console port's lock, which is
>    grabbed with "irqsave". ...but since we were stopping with an NMI
>    we didn't care about the "irqsave" and interrupted anyway.
> 3. Since we stopped the CPU while it was holding the lock it would
>    never release it.
> 4. All future calls to output to the console would end up failing to
>    get the lock in qcom_geni_serial_console_write(). This isn't
>    _totally_ unexpected at panic time but it's a code path that's not
>    well tested, hard to get right, and apparently doesn't work
>    terribly well on the Qualcomm geni serial driver.
>
> It would probably be a reasonable idea to try to make the Qualcomm
> geni serial driver work better, but also it's nice not to get into
> this situation in the first place.
>
> Taking a page from what x86 appears to do in native_stop_other_cpus(),
> let's do this:
> 1. First, we'll try to stop other CPUs with a normal IPI and wait a
>    second. This gives them a chance to leave critical sections.
> 2. If CPUs fail to stop then we'll retry with an NMI, but give a much
>    lower timeout since there's no good reason for a CPU not to react
>    quickly to a NMI.
>
> This works well and avoids the corrupted console and (presumably)
> could help avoid other similar issues.
>
> In order to do this, we need to do a little re-organization of our
> IPIs since we don't have any more free IDs. We'll do what was
> suggested in previous conversations and combine "stop" and "crash
> stop". That frees up an IPI so now we can have a "stop" and "stop
> NMI".
>
> In order to do this we also need a slight change in the way we keep
> track of which CPUs still need to be stopped. We need to know
> specifically which CPUs haven't stopped yet when we fall back to NMI
> but in the "crash stop" case the "cpu_online_mask" isn't updated as
> CPUs go down. This is why that code path had an atomic of the number
> of CPUs left. We'll solve this by making the cpumask into a
> global. This has a potential memory implication--with NR_CPUs = 4096
> this is 4096/8 = 512 bytes of globals. On the upside in that same case
> we take 512 bytes off the stack which could potentially have made the
> stop code less reliable. It can be noted that the NMI backtrace code
> (lib/nmi_backtrace.c) uses the same approach and that use also
> confirms that updating the mask is safe from NMI.
>
> All of the above lets us combine the logic for "stop" and "crash stop"
> code, which appeared to have a bunch of arbitrary implementation
> differences. Possibly this could make up for some of the 512 wasted
> bytes. ;-)
>
> Aside from the above change where we try a normal IPI and then an NMI,
> the combined function has a few subtle differences:
> * In the normal smp_send_stop(), if we fail to stop one or more CPUs
>   then we won't include the current CPU (the one running
>   smp_send_stop()) in the error message.
> * In crash_smp_send_stop(), if we fail to stop some CPUs we'll print
>   the CPUs that we failed to stop instead of printing all _but_ the
>   current running CPU.
> * In crash_smp_send_stop(), we will now only print "SMP: stopping
>   secondary CPUs" if (system_state <= SYSTEM_RUNNING).
>
> Fixes: d7402513c935 ("arm64: smp: IPI_CPU_STOP and IPI_CPU_CRASH_STOP should try for NMI")
> Signed-off-by: Douglas Anderson <dianders@...omium.org>
> ---
> I'm not setup to test the crash_smp_send_stop(). I made sure it
> compiled and hacked the panic() method to call it, but I haven't
> actually run kexec. Hopefully others can confirm that it's working for
> them.
>
>  arch/arm64/kernel/smp.c | 115 +++++++++++++++++++---------------------
>  1 file changed, 54 insertions(+), 61 deletions(-)
>
> diff --git a/arch/arm64/kernel/smp.c b/arch/arm64/kernel/smp.c
> index defbab84e9e5..9fe9d4342517 100644
> --- a/arch/arm64/kernel/smp.c
> +++ b/arch/arm64/kernel/smp.c
> @@ -71,7 +71,7 @@ enum ipi_msg_type {
>         IPI_RESCHEDULE,
>         IPI_CALL_FUNC,
>         IPI_CPU_STOP,
> -       IPI_CPU_CRASH_STOP,
> +       IPI_CPU_STOP_NMI,
>         IPI_TIMER,
>         IPI_IRQ_WORK,
>         NR_IPI,
> @@ -88,6 +88,9 @@ static int ipi_irq_base __ro_after_init;
>  static int nr_ipi __ro_after_init = NR_IPI;
>  static struct irq_desc *ipi_desc[MAX_IPI] __ro_after_init;
>
> +static DECLARE_BITMAP(stop_mask, NR_CPUS) __read_mostly;
> +static bool crash_stop;
> +
>  static void ipi_setup(int cpu);
>
>  #ifdef CONFIG_HOTPLUG_CPU
> @@ -770,7 +773,7 @@ static const char *ipi_types[NR_IPI] __tracepoint_string = {
>         [IPI_RESCHEDULE]        = "Rescheduling interrupts",
>         [IPI_CALL_FUNC]         = "Function call interrupts",
>         [IPI_CPU_STOP]          = "CPU stop interrupts",
> -       [IPI_CPU_CRASH_STOP]    = "CPU stop (for crash dump) interrupts",
> +       [IPI_CPU_STOP_NMI]      = "CPU stop NMIs",
>         [IPI_TIMER]             = "Timer broadcast interrupts",
>         [IPI_IRQ_WORK]          = "IRQ work interrupts",
>  };
> @@ -831,17 +834,11 @@ void __noreturn panic_smp_self_stop(void)
>         local_cpu_stop();
>  }
>
> -#ifdef CONFIG_KEXEC_CORE
> -static atomic_t waiting_for_crash_ipi = ATOMIC_INIT(0);
> -#endif
> -
>  static void __noreturn ipi_cpu_crash_stop(unsigned int cpu, struct pt_regs *regs)
>  {
>  #ifdef CONFIG_KEXEC_CORE
>         crash_save_cpu(regs, cpu);
>
> -       atomic_dec(&waiting_for_crash_ipi);

Upon reading the patch with fresh eyes, I think I actually need to
move the "cpumask_clear_cpu(cpu, to_cpumask(stop_mask))" here.
Specifically I think it's important that it happens _after_ the call
to crash_save_cpu().


>         local_irq_disable();

The above local_irq_disable() is not new for my patch but it seems
wonky for two reasons:

1. It feels like it should have been the first thing in the function.

2. It feels like it should be local_daif_mask() instead.

I _think_ it doesn't actually matter because, with the current code,
we're only ever called from do_handle_IPI() and thus local IRQs will
be off (and local NMIs will be off if we're called from NMI context).
However, once we have the IRQ + NMI fallback it _might_ matter if we
were midway through finally handling the IRQ-based IPI when we decided
to try the NMI-based one.

For the next spin of the patch I'll plan to get rid of the
local_irq_disable() and instead have local_daif_mask() be the first
thing that this function does.