linux-kernel - Re: [PATCH] panic: disable optimistic spin after halting CPUs

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <YeBjzX7Z6J1cVtjC@alley>
Date:   Thu, 13 Jan 2022 18:39:25 +0100
From:   Petr Mladek <pmladek@...e.com>
To:     Stephen Brennan <stephen.s.brennan@...cle.com>
Cc:     Greg Kroah-Hartman <gregkh@...uxfoundation.org>,
        Sergey Senozhatsky <senozhatsky@...omium.org>,
        Steven Rostedt <rostedt@...dmis.org>,
        John Ogness <john.ogness@...utronix.de>,
        Sergey Senozhatsky <sergey.senozhatsky@...il.com>,
        linux-kernel@...r.kernel.org
Subject: Re: [PATCH] panic: disable optimistic spin after halting CPUs

On Wed 2022-01-12 16:54:25, Stephen Brennan wrote:
> A CPU executing with console lock spinning enabled might be halted
> during a panic. Before console_flush_on_panic(), it's possible for
> console_trylock() to attempt optimistic spinning, deadlocking the panic
> CPU:
> 
> CPU 0 (panic CPU)             CPU 1
> -----------------             ------
>                               printk() {
>                                 vprintk_func() {
>                                   vprintk_default() {
>                                     vprintk_emit() {
>                                       console_unlock() {
>                                         console_lock_spinning_enable();
>                                         ... printing to console ...
> panic() {
>   crash_smp_send_stop() {
>     NMI  -------------------> HALT
>   }
>   atomic_notifier_call_chain() {
>     printk() {
>       ...
>       console_trylock_spinnning() {
>         // optimistic spin infinitely

I see.

> This hang during panic can be induced when a kdump kernel is loaded, and
> crash_kexec_post_notifiers=1 is present on the kernel command line. The
> following script which concurrently writes to /dev/kmsg, and triggers a
> panic, can result in this hang:
> 
>     #!/bin/bash
>     date
>     # 991 chars (based on log buffer size):
>     chars="$(printf 'a%.0s' {1..991})"
>     while :; do
>         echo $chars > /dev/kmsg
>     done &
>     echo c > /proc/sysrq-trigger &
>     date
>     exit
> 
> Below are the hang rates for the above test case, based on v5.16-rc8
> before and after this patch:
> 
> before:  197 hangs / 340 trials - 57.9%
> after :    0 hangs / 245 trials -  0.0%
> 
> Fixes: dbdda842fe96 ("printk: Add console owner and waiter logic to load balance console writes")
> 
> Signed-off-by: Stephen Brennan <stephen.s.brennan@...cle.com>
> Reviewed-by: Junxiao Bi <junxiao.bi@...cle.com>
> ---
>  include/linux/console.h |  1 +
>  kernel/panic.c          |  7 +++++++
>  kernel/printk/printk.c  | 17 +++++++++++++++++
>  3 files changed, 25 insertions(+)
> 
> diff --git a/include/linux/console.h b/include/linux/console.h
> index a97f277cfdfa..4eeb46927d96 100644
> --- a/include/linux/console.h
> +++ b/include/linux/console.h
> @@ -179,6 +179,7 @@ extern void console_unlock(void);
>  extern void console_conditional_schedule(void);
>  extern void console_unblank(void);
>  extern void console_flush_on_panic(enum con_flush_mode mode);
> +extern void console_lock_spinning_disable_on_panic(void);
>  extern struct tty_driver *console_device(int *);
>  extern void console_stop(struct console *);
>  extern void console_start(struct console *);
> diff --git a/kernel/panic.c b/kernel/panic.c
> index cefd7d82366f..a9340e580b20 100644
> --- a/kernel/panic.c
> +++ b/kernel/panic.c
> @@ -265,6 +265,13 @@ void panic(const char *fmt, ...)
>  		crash_smp_send_stop();
>  	}
>  
> +	/*
> +	 * Now that we've halted other CPUs, disable optimistic spinning in
> +	 * printk(). This avoids deadlocking in console_trylock(), until we call
> +	 * console_flush_on_panic().
> +	 */
> +	console_lock_spinning_disable_on_panic();

The proposed implementation causes that the panicing CPU takes over
console_lock even when the current owner is in the middle of
call_console_drivers().

It means that calling console drivers from another CPU is not
completely safe. This is why console_flush_on_panic() is called
at end when the console is the only way to see the messages.

We should not do this before kmsg_dump() and __crash_kexec()
that allow to see the messages a "more safe" way.

I suggest to disable the spinning when panic is in progress. I mean
something like:

--- a/kernel/printk/printk.c
+++ b/kernel/printk/printk.c
@@ -1842,6 +1842,10 @@ static int console_trylock_spinning(void)
 	if (console_trylock())
 		return 1;
 
+	/* Spinning is not safe when the system is stopped */
+	if (read_atomic(&panic_cpu) != PANIC_CPU_INVALID)
+		return 0;
+
 	printk_safe_enter_irqsave(flags);
 
 	raw_spin_lock(&console_owner_lock);


It would be also great when the current owner releases console_lock
so that the panicing CPU could take over it.

I think about the following. Well, I am not sure if it would help
in the real life because other CPUs are stopped quite early in panic().

--- a/kernel/printk/printk.c
+++ b/kernel/printk/printk.c
@@ -2646,13 +2650,18 @@ void console_unlock(void)
 
 	for (;;) {
 		size_t ext_len = 0;
-		int handover;
+		int handover, pcpu;
 		size_t len;
 
 skip:
 		if (!prb_read_valid(prb, console_seq, &r))
 			break;
 
+		/* Allow the panic_cpu to take over consoles a safe way. */
+		pcpu = read_atomic(&panic_cpu);
+		if (pcpu != PANIC_CPU_INVALID && pcpu != smp_processor_id())
+			break;
+
 		if (console_seq != r.info->seq) {
 			console_dropped += r.info->seq - console_seq;
 			console_seq = r.info->seq;



Note that the above code is not even compile tested. panic_cpu is
local variable in panic.c. If we go this way than I would define
some helpers:

bool panic_in_progress(void)
{
	return read_atomic(&panic_cpu) != PANIC_CPU_INVALID;
}

bool panic_in_progress_on_other_cpu(void)
{
	int pcpu = read_atomic(&panic_cpu);

	if (pcpu == PANIC_CPU_INVALID)
		return false;

	if (pcpu == smp_processor_id())
		return false;

	return true;
}

Best Regards,
Petr