lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <ZqD3fb8mQqWwePEU@pathway.suse.cz>
Date: Wed, 24 Jul 2024 14:45:49 +0200
From: Petr Mladek <pmladek@...e.com>
To: John Ogness <john.ogness@...utronix.de>
Cc: Rik van Riel <riel@...riel.com>,
	Andrew Morton <akpm@...ux-foundation.org>,
	Omar Sandoval <osandov@...a.com>, linux-kernel@...r.kernel.org,
	Steven Rostedt <rostedt@...dmis.org>,
	Sergey Senozhatsky <senozhatsky@...omium.org>
Subject: Re: [RFC PATCH] nmi,printk: fix ABBA deadlock between nmi_backtrace
 and dump_stack_lvl

On Wed 2024-07-17 09:22:21, John Ogness wrote:
> On 2024-07-15, Rik van Riel <riel@...riel.com> wrote:
> > Both nmi_backtrace and dump_stack_lvl call printk_cpu_sync_get_irqsave.
> >
> > However, dump_stack_lvl will call into the printk code, while holding
> > the printk_cpu_sync_get lock, and then take the console lock.
> >
> > Another CPU may end up getting an NMI stack trace printed, after
> > being stuck printing something to serial console for too long,
> > with the console lock held.
> >
> > This results in the following lock order:
> > CPU A: printk_cpu_sync_get lock -> console_lock
> > CPU B: console_lock -> (nmi) printk_cpu_sync_get lock
> >
> > This will cause the system to hang with an ABBA deadlock
> 
> The console lock is acquired via trylock, so that will not yield
> deadlock here. However, if CPU B was printing, then CPU A will spin on
> @console_waiter (in console_trylock_spinning()). _That_ is a deadlock.
>
> The purpose of printk_cpu_sync_get_irqsave() is to avoid having the
> different backtraces being interleaved in the _ringbuffer_. It really
> isn't necessary that they are printed in that context. And indeed, there
> is no guarantee that they will be printed in that context anyway.

Well, backtraces are printed when unexpected things happen. It is
usually an emergency situation and we should do our best to
flush consoles ASAP.

> Perhaps a simple solution would be for printk_cpu_sync_get_irqsave() to
> call printk_deferred_enter/_exit. Something like the below patch.
> 
> John Ogness
> 
> diff --git a/include/linux/printk.h b/include/linux/printk.h
> index 65c5184470f1..1a6f5aac28bf 100644
> --- a/include/linux/printk.h
> +++ b/include/linux/printk.h
> @@ -315,8 +315,10 @@ extern void __printk_cpu_sync_put(void);
>  #define printk_cpu_sync_get_irqsave(flags)		\
>  	for (;;) {					\
>  		local_irq_save(flags);			\
> +		printk_deferred_enter();		\
>  		if (__printk_cpu_sync_try_get())	\
>  			break;				\
> +		printk_deferred_exit();			\
>  		local_irq_restore(flags);		\
>  		__printk_cpu_sync_wait();		\
>  	}
> @@ -329,6 +331,7 @@ extern void __printk_cpu_sync_put(void);
>  #define printk_cpu_sync_put_irqrestore(flags)	\
>  	do {					\
>  		__printk_cpu_sync_put();	\
> +		printk_deferred_exit();		\
>  		local_irq_restore(flags);	\
>  	} while (0)
>  

OK, there is the basic rule: "Never take a lock in NMI when the lock might
be taken in another context". printk_cpu_sync_get() violates the rule.
But it is safe as long as the lock is re-entrant and tail.

The above patch fixes a situation where printk_cpu_sync_get() was not
a tail lock. So it looks like a reasonable fix if we want to keep it simple.

Well, I still prefer the alternative solution at
https://lore.kernel.org/r/93155b2ccafa43ed4845ae450451c6b8e27509cc.camel@surriel.com
which does not deffer the console output in normal/IRQ context. There
are doubts whether it is safe. I think that it is. Let me to reply there.

Best Regards,
Petr

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ