[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20190723031340.GA19463@jagdpanzerIV>
Date: Tue, 23 Jul 2019 12:13:40 +0900
From: Sergey Senozhatsky <sergey.senozhatsky.work@...il.com>
To: Petr Mladek <pmladek@...e.com>
Cc: Sergey Senozhatsky <sergey.senozhatsky.work@...il.com>,
Konstantin Khlebnikov <koct9i@...il.com>,
Sergey Senozhatsky <sergey.senozhatsky@...il.com>,
Steven Rostedt <rostedt@...dmis.org>,
Peter Zijlstra <peterz@...radead.org>, x86@...nel.org,
John Ogness <john.ogness@...utronix.de>,
Thomas Gleixner <tglx@...utronix.de>,
Andrew Morton <akpm@...ux-foundation.org>,
Petr Tesarik <ptesarik@...e.cz>, linux-kernel@...r.kernel.org
Subject: Re: [PATCH 1/2] printk/panic: Access the main printk log in panic()
only when safe
On (07/19/19 14:57), Petr Mladek wrote:
[..]
> > Where do nested printk()-s come from? Which one of the following
> > scenarios you cover in commit message:
> >
> > scenario 1
> >
> > - we have CPUB which holds logbuf_lock
> > - we have CPUA which panic()-s the system, but can't bring CPUB down,
> > so logbuf_lock stays locked on remote CPU
>
> No, this scenario is not affected by this patch. It would always lead to
> a deadlock.
Agreed, in many cases we won't be able to panic() the system properly,
deadlocking somewhere in smp_send_stop().
> > scenario 2
> >
> > - we have CPUA which holds logbuf_lock
> > - we have panic() on CPUA, but it cannot bring down some other CPUB
> > so logbuf_lock stays locked on local CPU, and it cannot re-init
> > logbuf.
[..]
> + Before:
> + printk_safe_flush_on_panic() will keep logbuf_lock locked
> and do nothing.
>
> + kmsg_dump(), console_unblank(), or console_flush_on_panic()
> will deadlock when they try to get logbuf_lock(). They will
> not be able to process any single line.
>
> + After:
> + printk_bust_lock_safe() will keep logbuf_lock locked
>
> + All functions using logbuf_lock will not get called.
> We will not see the messages (as previously) but the
> system will not deadlock.
>
>
> But there is one more scenario 3:
Yes!
> - we have CPUB which loops or is deadlocked in IRQ context
>
> - we have CPUA which panic()-s the system, but can't bring CPUB down,
> so logbuf_lock might be takes and release from time to time
> by CPUB
Great!
This is the only case when we actually need to pay attention to
num_online_cpus(), because there is an active logbuf_lock owner;
in any other case we can unconditionally re-init printk() locks.
But there is more to it.
Note, that the problem in scenario 3 is bigger than just logbuf_lock.
Regardless of logbuf implementation we will not be able to panic()
the system.
If we have a never ending source of printk() messages, coming from
misbehaving CPU which stuck in printing loop in IRQ context, then
flush_on_panic() will never end or kmsg dump will never stop, etc.
We need to cut off misbehaving CPUs. Panic CPU waits (for up to 1
second?) in smp_send_stop() for secondary CPUs to die, if some
secondary CPUs are still chatty after that then most likely those
CPUs don't have anything good to say, just a pointless flood of same
messages over and over again; which, however, will not let panic
CPU to proceed.
And this is where the idea of "disconnecting" those CPUs from main
logbuf come from.
So what we can do:
- smp_send_stop()
- disconnect all-but-self from logbuf (via printk-safe)
- wait for 1 or 2 more extra seconds for secondary CPUs to leave
console_unlock() and to redirect printks to per-CPU buffers
- after that we are sort of good-to-go: re-init printk locks
and do kmsg_dump, flush_on_panic().
Note, misbehaving CPUs will write to per-CPU buffers, they are not
expected to be able to flush per-CPU buffers to the main logbuf. That
will require enabled IRQs, which should deliver stop IPI. But we can
do even more - just disable print_safe irq work on disconnect CPUs.
So, shall we try one more time with the "disconnect" misbehaving CPUs
approach? I can send an RFC patch.
-ss
Powered by blists - more mailing lists