linux-kernel - Re: [PATCH 2/4] printk: disable optimistic spin during panic

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <YfJFjHdg/khNXiRd@google.com>
Date:   Thu, 27 Jan 2022 16:11:08 +0900
From:   Sergey Senozhatsky <senozhatsky@...omium.org>
To:     Stephen Brennan <stephen.s.brennan@...cle.com>
Cc:     Sergey Senozhatsky <senozhatsky@...omium.org>,
        John Ogness <john.ogness@...utronix.de>,
        Petr Mladek <pmladek@...e.com>,
        Steven Rostedt <rostedt@...dmis.org>,
        Sergey Senozhatsky <sergey.senozhatsky@...il.com>,
        linux-kernel@...r.kernel.org
Subject: Re: [PATCH 2/4] printk: disable optimistic spin during panic

On (22/01/26 10:15), Stephen Brennan wrote:
[..]
> > On (22/01/26 10:51), John Ogness wrote:
> >> > Is there something that prevents panic CPU from NMI hlt CPU which is
> >> > in console_trylock() under raw_spin_lock_irqsave()?
> >> >
> >> >  CPU0				CPU1
> >> > 				console_trylock_spinnning()
> >> > 				 console_trylock()
> >> > 				  down_trylock()
> >> > 				   raw_spin_lock_irqsave(&sem->lock)
> >> >
> >> >  panic()
> >> >   crash_smp_send_stop()
> >> >    NMI 			-> 		HALT
> >> 
> >> This is a good point. I wonder if console_flush_on_panic() should
> >> perform a sema_init() before it does console_trylock().
> >
> > A long time ago there was zap_locks() function in printk, that used
> > to re-init console semaphore and logbuf spin_lock, but _only_ in case
> > of printk recursion (which was never reliable)
> >
> > https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tree/kernel/printk/printk.c?h=v4.9.297#n1557
> >
> > This has been superseded by printk_safe per-CPU buffers so we removed
> > that function.
> >
> > So it could be that may be we want to introduce something similar to
> > zap_locks() again.
> >
> > All reasonable serial consoles drivers should take oops_in_progress into
> > consideration in ->write(), so we probably don't care for console_drivers
> > spinlocks, etc. but potentially can do a bit better on the printk side.
> 
> I see the concern here. If a CPU is halted while holding
> console_sem.lock spinlock, then the very next printk would hang, since
> each vprintk_emit() does a trylock.

Right. So I also thought about placing panic_in_progress() somewhere in
console_trylock() and make it fail for anything that is not a panic CPU.

> Now in my thousands of iterations of tests, I haven't been lucky enough
> to interrupt a CPU in the middle of this critical section. The critical
> section itself is incredibly short and so it's hard to do it. Not
> impossible, I'd imagine.

I can imagine that the race window is really small, and I'm not insisting
on fixing it right now (or ever for that matter).

Basically, we now have two different "something bad is in progress"
that affect two different ends of the calls stack. bust_spinlocks()
sets oops_in_progress and affects console drivers' spinlocks, but has
no meaning to any other printk locks. And then we have panic_in_progress()
which is meaningful to some printk locks, but not to all of them, and is
meaningless to console drivers, because those look at oops_in_progress.

If printk folks are fine with that then I'm also fine.

> We can't fix it in console_flush_on_panic(), because that is called much
> later, after we've called the panic notifiers, which definitely
> printk(). If we wanted to re-initialize the console_sem, we'd want it
> done earlier in panic(), directly after the NMI was sent.

Right.

> My understanding was that we can't be too cautious regarding the console
> drivers. Sure, they _shouldn't_ have any race conditions, but once we're
> in panic we're better off avoiding the console drivers unless it's our
> last choice. So, is it worth re-initializing the console_sem early in
> panic, which forces all the subsequent printk to go out to the consoles?
> I don't know.
>
> One alternative is to do __printk_safe_enter() at the beginning of
> panic. This effectively guarantees that no printk will hit the console
> drivers or even attempt to grab the console_sem. Then, we can do the
> kmsg_dump, do a crash_kexec if configured, and only when all options
> have been exhausted would we reinitialize the console_sem and flush to
> the console. Maybe this is too cautious, but it is an alternative.

Back in the days we also had this idea of "detaching" non-panic CPUs from
printk() by overwriting their printk function pointers.