lists.openwall.net | lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC | |
Open Source and information security mailing list archives
| ||
|
Date: Thu, 13 Jan 2022 11:36:35 -0800 From: Stephen Brennan <stephen.s.brennan@...cle.com> To: Petr Mladek <pmladek@...e.com> Cc: Greg Kroah-Hartman <gregkh@...uxfoundation.org>, Sergey Senozhatsky <senozhatsky@...omium.org>, Steven Rostedt <rostedt@...dmis.org>, John Ogness <john.ogness@...utronix.de>, Sergey Senozhatsky <sergey.senozhatsky@...il.com>, linux-kernel@...r.kernel.org Subject: Re: [PATCH] panic: disable optimistic spin after halting CPUs Hi Petr, Thanks for taking a look! Petr Mladek <pmladek@...e.com> writes: > On Wed 2022-01-12 16:54:25, Stephen Brennan wrote: >> A CPU executing with console lock spinning enabled might be halted >> during a panic. Before console_flush_on_panic(), it's possible for >> console_trylock() to attempt optimistic spinning, deadlocking the panic >> CPU: >> >> CPU 0 (panic CPU) CPU 1 >> ----------------- ------ >> printk() { >> vprintk_func() { >> vprintk_default() { >> vprintk_emit() { >> console_unlock() { >> console_lock_spinning_enable(); >> ... printing to console ... >> panic() { >> crash_smp_send_stop() { >> NMI -------------------> HALT >> } >> atomic_notifier_call_chain() { >> printk() { >> ... >> console_trylock_spinnning() { >> // optimistic spin infinitely > > I see. > >> This hang during panic can be induced when a kdump kernel is loaded, and >> crash_kexec_post_notifiers=1 is present on the kernel command line. The >> following script which concurrently writes to /dev/kmsg, and triggers a >> panic, can result in this hang: >> >> #!/bin/bash >> date >> # 991 chars (based on log buffer size): >> chars="$(printf 'a%.0s' {1..991})" >> while :; do >> echo $chars > /dev/kmsg >> done & >> echo c > /proc/sysrq-trigger & >> date >> exit >> >> Below are the hang rates for the above test case, based on v5.16-rc8 >> before and after this patch: >> >> before: 197 hangs / 340 trials - 57.9% >> after : 0 hangs / 245 trials - 0.0% >> >> Fixes: dbdda842fe96 ("printk: Add console owner and waiter logic to load balance console writes") >> >> Signed-off-by: Stephen Brennan <stephen.s.brennan@...cle.com> >> Reviewed-by: Junxiao Bi <junxiao.bi@...cle.com> >> --- >> include/linux/console.h | 1 + >> kernel/panic.c | 7 +++++++ >> kernel/printk/printk.c | 17 +++++++++++++++++ >> 3 files changed, 25 insertions(+) >> >> diff --git a/include/linux/console.h b/include/linux/console.h >> index a97f277cfdfa..4eeb46927d96 100644 >> --- a/include/linux/console.h >> +++ b/include/linux/console.h >> @@ -179,6 +179,7 @@ extern void console_unlock(void); >> extern void console_conditional_schedule(void); >> extern void console_unblank(void); >> extern void console_flush_on_panic(enum con_flush_mode mode); >> +extern void console_lock_spinning_disable_on_panic(void); >> extern struct tty_driver *console_device(int *); >> extern void console_stop(struct console *); >> extern void console_start(struct console *); >> diff --git a/kernel/panic.c b/kernel/panic.c >> index cefd7d82366f..a9340e580b20 100644 >> --- a/kernel/panic.c >> +++ b/kernel/panic.c >> @@ -265,6 +265,13 @@ void panic(const char *fmt, ...) >> crash_smp_send_stop(); >> } >> >> + /* >> + * Now that we've halted other CPUs, disable optimistic spinning in >> + * printk(). This avoids deadlocking in console_trylock(), until we call >> + * console_flush_on_panic(). >> + */ >> + console_lock_spinning_disable_on_panic(); > > The proposed implementation causes that the panicing CPU takes over > console_lock even when the current owner is in the middle of > call_console_drivers(). I don't think this is true. All that console_lock_spinning_disable_on_panic() does is clear out the console_owner and console_owner_lock, signalling that the panicking CPU should not attempt to spin and wait for the lock. The only time this matters is when another task already holds console_sem and is executing presently in call_console_drivers(), as you said. With console_owner set to NULL, console_trylock_spinning() will get spin=false, and thus it will bail out, returning 0. Thus, the panicking CPU would fail to acquire console_sem, and so in vprintk_emit it will not end up going into console_unlock(), or doing call_console_drivers(). It would not be until console_flush_on_panic() down below (when there is no other option than to try to use the console) that the panicking CPU actually attempts to call the drivers. > > It means that calling console drivers from another CPU is not > completely safe. This is why console_flush_on_panic() is called > at end when the console is the only way to see the messages. > > We should not do this before kmsg_dump() and __crash_kexec() > that allow to see the messages a "more safe" way. > > I suggest to disable the spinning when panic is in progress. I mean > something like: > > --- a/kernel/printk/printk.c > +++ b/kernel/printk/printk.c > @@ -1842,6 +1842,10 @@ static int console_trylock_spinning(void) > if (console_trylock()) > return 1; > > + /* Spinning is not safe when the system is stopped */ > + if (read_atomic(&panic_cpu) != PANIC_CPU_INVALID) > + return 0; > + > printk_safe_enter_irqsave(flags); > > raw_spin_lock(&console_owner_lock); I think this is pretty much equivalent to my fix, since it also results in the panicking CPU failing to acquire console_sem during console_trylock_spinning(). I do think this is better than my fix :) since it doesn't clutter up panic() even more, nor introduce an additional function. It even avoids resetting the console_owner_lock spinlock. I'd be happy to do this as a v2, if you'd prefer? > > It would be also great when the current owner releases console_lock > so that the panicing CPU could take over it. > > I think about the following. Well, I am not sure if it would help > in the real life because other CPUs are stopped quite early in panic(). > > --- a/kernel/printk/printk.c > +++ b/kernel/printk/printk.c > @@ -2646,13 +2650,18 @@ void console_unlock(void) > > for (;;) { > size_t ext_len = 0; > - int handover; > + int handover, pcpu; > size_t len; > > skip: > if (!prb_read_valid(prb, console_seq, &r)) > break; > > + /* Allow the panic_cpu to take over consoles a safe way. */ > + pcpu = read_atomic(&panic_cpu); > + if (pcpu != PANIC_CPU_INVALID && pcpu != smp_processor_id()) > + break; > + > if (console_seq != r.info->seq) { > console_dropped += r.info->seq - console_seq; > console_seq = r.info->seq; > I suppose this could help reduce the odds of a CPU getting interrupted during call_console_drivers(), and it might reduce the odds of console_sem being held by a halted CPU during panic(). However, it occurs to me that there are two cases: 1) a kdump kernel is loaded. In this case, crash_smp_send_stop() is used, which (at least on x86_64) sends NMIs. In this case, __crash_kexec() will not return (barring any errors) and we won't ever really need to get the console_sem. 2) no kdump kernel is loaded. In this case, smp_send_stop() is used, which (again on x86_64) sends regular IPIs. In this case, we know we can at least avoid interrupting call_console_drivers(), since that executes with local IRQ disabled. So I'm also unsure how much this would help in the real world. But it's a small change and so it doesn't need to help much in order to be a net positive :) Thanks, Stephen > > > Note that the above code is not even compile tested. panic_cpu is > local variable in panic.c. If we go this way than I would define > some helpers: > > bool panic_in_progress(void) > { > return read_atomic(&panic_cpu) != PANIC_CPU_INVALID; > } > > bool panic_in_progress_on_other_cpu(void) > { > int pcpu = read_atomic(&panic_cpu); > > if (pcpu == PANIC_CPU_INVALID) > return false; > > if (pcpu == smp_processor_id()) > return false; > > return true; > } > > Best Regards, > Petr
Powered by blists - more mailing lists