linux-kernel - Re: [PATCH printk v2 2/5] printk: Add NMI safety to console_flush_on_panic() and console

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <ZLANiK_1YoBu1YpU@alley>
Date:   Thu, 13 Jul 2023 16:43:20 +0200
From:   Petr Mladek <pmladek@...e.com>
To:     John Ogness <john.ogness@...utronix.de>
Cc:     Sergey Senozhatsky <senozhatsky@...omium.org>,
        Steven Rostedt <rostedt@...dmis.org>,
        Thomas Gleixner <tglx@...utronix.de>,
        linux-kernel@...r.kernel.org
Subject: Re: [PATCH printk v2 2/5] printk: Add NMI safety to
 console_flush_on_panic() and console_unblank()

On Wed 2023-07-12 23:17:49, John Ogness wrote:
> On 2023-07-11, Petr Mladek <pmladek@...e.com> wrote:
> > Just to be sure. The semaphore is not NMI safe because even the
> > trylock takes an internal spin lock. Am I right, please?
> 
> Yes, that is one of the reasons. Sergey mentioned another (waking a task
> on up()).

I see.

> > Alternative solution would be to make down_trylock() NMI safe
> > by using raw_spin_trylock_irqsave() for the internal lock.
> 
> NMI contexts are only allowed to take raw spinlocks if those spinlocks
> are only used from NMI context. Otherwise you could have deadlock:
> 
> raw_spin_lock()
> --- NMI ---
> raw_spin_lock()
> 
> Using a trylock does not avoid the deadlock danger.
> 
> > Another question is whether we want to call c->unblank()
> > in NMI even when down_trylock() was NMI safe. It seems that it
> > is implemented only for struct console vt_console_driver.
> > I am pretty sure that it takes more internal locks which
> > are not NMI safe either.
> 
> Yes, it does. As an example, it calls mod_timer(), which is also not NMI
> safe. Clearly the unblank() callback must not be called in NMI context.
> 
> > Finally, it is not only about NMI. Any locks might cause a deadlock
> > in panic() in any context. It is because other CPUs are stopped
> > and might block some locks.
> 
> With the atomic/threaded model this is not true. The port ownership can
> be safely taken over from stopped CPUs.

Right. But it would mean using the special lock also in c->unblank()
code. And it is only tty console which is one of the most complicated
consoles.

> > In my opinion, we should handle c->unblank() in panic() the same way
> > as c->write() in panic().
> 
> I do not agree. Clearly unblank() is not NMI safe. Also, in current
> mainline code, console_unblank() will already give up if the trylock
> failed (rather than ignoring the lock, like write() does). So
> console_unblank() might as well also give up if in NMI context.

You are right.

OK, could we at least improve the commit message, please?

Something like:

<proposal>
console_sem() is not NMI safe even when using down_trylock(). Both
down_trylock() and up() are using an interal spinlock. up() might even
call wake_up_process() with another locks.

It is even worse in the panic() code path where the locks might be blocked
by stopped CPUs.

The sepaphore is used in two code paths, in console_unblank() and when
flushing console messages. On the low level, it is needed when calling
c->unblank() and c->write() callbacks.

Both code paths are not safe in panic() but they are handled differently.
c->unblank() is called only when console_trylock() succeeded. c->write()
is called in console_flush_on_panic() even when console_trylock().
The risk of a deadlock in c->write() callbacks is reduced by using trylock()
for the internal locks when oops_in_progess variable is set.

Reduce the risk of deadlocks caused the console semapthore by:

  + bailing out from console_unblank() in NMI
  + not taking the console_sem() in console_flush_on_panic() at all

Simple removal of console_trylock() in console_flush_on_panic() would
cause that other CPUs might still be able to take it and race.
The problem is avoided by checking panic_in_progress() in console_lock()
and console_trylock(). They will never succeed on non-panic CPUs.

The change is a preparation step for introducing printk kthreads and
atomic console write callbacks. It would make the panic() code path
completely safe for consoles without c->unblank() callback.
</proposal>

Wait, the last paragraph is not true. console_trylock() is still
called in console_unblank() in non-NMI context. But the lock
might be blocked by a stopped CPU.

It can be solved by checking whether there is any registered console
with c->unblank() callback first. As a result, console_trylock()
would be called only when a tty console is registered. The panic() path
really might be completely safe where only safe consoles are
registered.

It would make sense to do separate patch for console_unblank()
and console_flush_on_panic().

Of course, we might also improve console_unblank() later. But I
would still like to improve the commit message. You know, the original
commit title and message is talking about NMI. But the patch
has effects even in non-NMI context.

Best Regards,
Petr