[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-Id: <20240801082721.62935-1-takakura@valinux.co.jp>
Date: Thu, 1 Aug 2024 17:27:21 +0900
From: takakura@...inux.co.jp
To: pmladek@...e.com,
john.ogness@...utronix.de
Cc: akpm@...ux-foundation.org,
bhe@...hat.com,
feng.tang@...el.com,
j.granados@...sung.com,
linux-kernel@...r.kernel.org,
lukas@...ner.de,
nishimura@...inux.co.jp,
rostedt@...dmis.org,
senozhatsky@...omium.org,
stephen.s.brennan@...cle.com,
taka@...inux.co.jp,
takakura@...inux.co.jp,
ubizjak@...il.com,
wangkefeng.wang@...wei.com
Subject: Re: [PATCH] printk: CPU backtrace not printing on panic
Hi Petr and John,
On 2024-07-30, Petr Mladek <pmladek@...e.com> wrote:
>On Fri 2024-07-26 16:02:45, John Ogness wrote:
>> On 2024-07-26, Petr Mladek <pmladek@...e.com> wrote:
>> > I would do it the other way and enable printing from other CPUs only
>> > when triggring the backtrace. We could do it because
>> > trigger_all_cpu_backtrace() waits until all backtraces are
>> > printed.
>> >
>> > Something like:
>> >
>> > diff --git a/include/linux/panic.h b/include/linux/panic.h
>> > index 3130e0b5116b..980bacbdfcfc 100644
>> > --- a/include/linux/panic.h
>> > +++ b/include/linux/panic.h
>> > @@ -16,6 +16,7 @@ extern void oops_enter(void);
>> > extern void oops_exit(void);
>> > extern bool oops_may_print(void);
>> >
>> > +extern int panic_triggering_all_cpu_backtrace;
>> > extern int panic_timeout;
>> > extern unsigned long panic_print;
>> > extern int panic_on_oops;
>> > diff --git a/kernel/panic.c b/kernel/panic.c
>> > index f861bedc1925..7e9e97d59b1e 100644
>> > --- a/kernel/panic.c
>> > +++ b/kernel/panic.c
>> > @@ -64,6 +64,8 @@ unsigned long panic_on_taint;
>> > bool panic_on_taint_nousertaint = false;
>> > static unsigned int warn_limit __read_mostly;
>> >
>> > +int panic_triggering_all_cpu_backtrace;
>> > +
>> > int panic_timeout = CONFIG_PANIC_TIMEOUT;
>> > EXPORT_SYMBOL_GPL(panic_timeout);
>> >
>> > @@ -253,8 +255,12 @@ void check_panic_on_warn(const char *origin)
>> > */
>> > static void panic_other_cpus_shutdown(bool crash_kexec)
>> > {
>> > - if (panic_print & PANIC_PRINT_ALL_CPU_BT)
>> > + if (panic_print & PANIC_PRINT_ALL_CPU_BT) {
>> > + /* Temporary allow printing messages on non-panic CPUs. */
>> > + panic_triggering_all_cpu_backtrace = true;
>> > trigger_all_cpu_backtrace();
>> > + panic_triggering_all_cpu_backtrace = false;
>>
>> Note, here we should also add
>>
>> nbcon_atomic_flush_pending();
>>
>> Your suggestion allows the other CPUs to dump their backtrace into the
>> ringbuffer, but they are still forbidden from acquiring the nbcon
>> console contexts for printing. That is a necessary requirement of
>> nbcon_waiter_matches().
>
>Great catch!
>
>I would prefer to solve this in a separate patch. This problem existed
>even before the commit 779dbc2e78d7 ("printk: Avoid non-panic CPUs writing
>to ringbuffer"). In fact, the problem existed very long time even for
>the legacy consoles.
>
Good point! I guess the problem existed since the commit 51a1d258e50e
("printk: Keep non-panic-CPUs out of console lock") as it forbade the
acquisition of console lock for non-panic cpus?
>It is pity that we need to handle both consoles separately. IMHO,
>we could get the same job done by calling
>
> console_flush_on_panic(CONSOLE_FLUSH_PENDING);
>
>It flushes both nbcon and legacy consoles.
>
>> Or since the cpu_sync is held while printing the backtrace, we could
>> allow the non-panic CPUs to print by modifying the check in
>> nbcon_context_try_acquire_direct():
>>
>> ----- BEGIN -----
>> diff --git a/kernel/printk/nbcon.c b/kernel/printk/nbcon.c
>> index ef6e76db0f5a..cd8724840edc 100644
>> --- a/kernel/printk/nbcon.c
>> +++ b/kernel/printk/nbcon.c
>> @@ -241,7 +241,7 @@ static int nbcon_context_try_acquire_direct(struct nbcon_context *ctxt,
>> struct nbcon_state new;
>>
>> do {
>> - if (other_cpu_in_panic())
>> + if (other_cpu_in_panic() && !__printk_cpu_sync_owner())
>
>Interesting idea. I am not completely against it.
>
>Well, this would be the only situation when nmi_cpu_backtrace() would
>be allowed to flush the messages directly. Also it would be yet
>another exception.
>
>I would probably keep it simple and just flush the messages from
>the panic-CPU (using console_flush_on_panic(CONSOLE_FLUSH_PENDING).
>
>
>> return -EPERM;
>>
>> if (ctxt->prio <= cur->prio || ctxt->prio <= cur->req_prio)
>> > --- a/kernel/printk/printk.c
>> > +++ b/kernel/printk/printk.c
>> > @@ -2316,7 +2316,7 @@ asmlinkage int vprintk_emit(int facility, int level,
>> > * non-panic CPUs are generating any messages, they will be
>> > * silently dropped.
>> > */
>> > - if (other_cpu_in_panic())
>> > + if (other_cpu_in_panic() && !panic_triggering_all_cpu_backtrace)
>> > return 0;
>>
>> I wonder if it is enough to check if it is holding the cpu_sync. Then we
>> would not need @panic_triggering_all_cpu_backtrace.
>
>I prefer to keep panic_triggering_all_cpu_backtrace. I know, it is an
>ugly long name. But it clearly defines what we want to achieve.
>And it limits the exception to printing the backtraces.
>
>The check of the cpu_owner would work now because it is used basically
>only for the backtraces. But it might change anytime in the future.
>cpu_owner is a "generic" lock. I guess that it will be used
>in more situations in the future. Any change might break this
>scenario again...
>
I agree that the checking of cpu_owner can be insufficient in the future and
the use of panic_triggering_all_cpu_backtrace is more reliable in that sense.
>Summary:
>
>I prefer two patches:
>
> 1st patch would allow storing the backtraces using the variable
> panic_triggering_all_cpu_backtrace (better name appreciated).
>
> 2nd patch would cause flushing the backtraces. And I would use
> console_flush_on_panic(CONSOLE_FLUSH_PENDING) as a variant
> which can be backported to stable kernels. It might later
> be updated by the upcoming printk rework.
>
>Best Regards,
>Petr
Thanks! I'll prepare another patch based on it!
Sincerely,
Ryo Takakura
Powered by blists - more mailing lists