linux-kernel - Re: [PATCH] printk: CPU backtrace not printing on panic

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <ZqjZMIxgm46qVgjL@pathway.suse.cz>
Date: Tue, 30 Jul 2024 14:14:40 +0200
From: Petr Mladek <pmladek@...e.com>
To: John Ogness <john.ogness@...utronix.de>
Cc: takakura@...inux.co.jp, rostedt@...dmis.org, senozhatsky@...omium.org,
	akpm@...ux-foundation.org, bhe@...hat.com, lukas@...ner.de,
	wangkefeng.wang@...wei.com, ubizjak@...il.com, feng.tang@...el.com,
	j.granados@...sung.com, stephen.s.brennan@...cle.com,
	linux-kernel@...r.kernel.org, nishimura@...inux.co.jp,
	taka@...inux.co.jp
Subject: Re: [PATCH] printk: CPU backtrace not printing on panic

On Fri 2024-07-26 16:02:45, John Ogness wrote:
> On 2024-07-26, Petr Mladek <pmladek@...e.com> wrote:
> > I would do it the other way and enable printing from other CPUs only
> > when triggring the backtrace. We could do it because
> > trigger_all_cpu_backtrace() waits until all backtraces are
> > printed.
> >
> > Something like:
> >
> > diff --git a/include/linux/panic.h b/include/linux/panic.h
> > index 3130e0b5116b..980bacbdfcfc 100644
> > --- a/include/linux/panic.h
> > +++ b/include/linux/panic.h
> > @@ -16,6 +16,7 @@ extern void oops_enter(void);
> >  extern void oops_exit(void);
> >  extern bool oops_may_print(void);
> >  
> > +extern int panic_triggering_all_cpu_backtrace;
> >  extern int panic_timeout;
> >  extern unsigned long panic_print;
> >  extern int panic_on_oops;
> > diff --git a/kernel/panic.c b/kernel/panic.c
> > index f861bedc1925..7e9e97d59b1e 100644
> > --- a/kernel/panic.c
> > +++ b/kernel/panic.c
> > @@ -64,6 +64,8 @@ unsigned long panic_on_taint;
> >  bool panic_on_taint_nousertaint = false;
> >  static unsigned int warn_limit __read_mostly;
> >  
> > +int panic_triggering_all_cpu_backtrace;
> > +
> >  int panic_timeout = CONFIG_PANIC_TIMEOUT;
> >  EXPORT_SYMBOL_GPL(panic_timeout);
> >  
> > @@ -253,8 +255,12 @@ void check_panic_on_warn(const char *origin)
> >   */
> >  static void panic_other_cpus_shutdown(bool crash_kexec)
> >  {
> > -	if (panic_print & PANIC_PRINT_ALL_CPU_BT)
> > +	if (panic_print & PANIC_PRINT_ALL_CPU_BT) {
> > +		/* Temporary allow printing messages on non-panic CPUs. */
> > +		panic_triggering_all_cpu_backtrace = true;
> >  		trigger_all_cpu_backtrace();
> > +		panic_triggering_all_cpu_backtrace = false;
> 
> Note, here we should also add
> 
> 		nbcon_atomic_flush_pending();
> 
> Your suggestion allows the other CPUs to dump their backtrace into the
> ringbuffer, but they are still forbidden from acquiring the nbcon
> console contexts for printing. That is a necessary requirement of
> nbcon_waiter_matches().

Great catch!

I would prefer to solve this in a separate patch. This problem existed
even before the commit 779dbc2e78d7 ("printk: Avoid non-panic CPUs writing
to ringbuffer"). In fact, the problem existed very long time even for
the legacy consoles.

It is pity that we need to handle both consoles separately. IMHO,
we could get the same job done by calling

	console_flush_on_panic(CONSOLE_FLUSH_PENDING);

It flushes both nbcon and legacy consoles.

> Or since the cpu_sync is held while printing the backtrace, we could
> allow the non-panic CPUs to print by modifying the check in
> nbcon_context_try_acquire_direct():
> 
> ----- BEGIN -----
> diff --git a/kernel/printk/nbcon.c b/kernel/printk/nbcon.c
> index ef6e76db0f5a..cd8724840edc 100644
> --- a/kernel/printk/nbcon.c
> +++ b/kernel/printk/nbcon.c
> @@ -241,7 +241,7 @@ static int nbcon_context_try_acquire_direct(struct nbcon_context *ctxt,
>  	struct nbcon_state new;
>  
>  	do {
> -		if (other_cpu_in_panic())
> +		if (other_cpu_in_panic() && !__printk_cpu_sync_owner())

Interesting idea. I am not completely against it.

Well, this would be the only situation when nmi_cpu_backtrace() would
be allowed to flush the messages directly. Also it would be yet
another exception.

I would probably keep it simple and just flush the messages from
the panic-CPU (using console_flush_on_panic(CONSOLE_FLUSH_PENDING).


>  			return -EPERM;
>  
>  		if (ctxt->prio <= cur->prio || ctxt->prio <= cur->req_prio)
> > --- a/kernel/printk/printk.c
> > +++ b/kernel/printk/printk.c
> > @@ -2316,7 +2316,7 @@ asmlinkage int vprintk_emit(int facility, int level,
> >  	 * non-panic CPUs are generating any messages, they will be
> >  	 * silently dropped.
> >  	 */
> > -	if (other_cpu_in_panic())
> > +	if (other_cpu_in_panic() && !panic_triggering_all_cpu_backtrace)
> >  		return 0;
> 
> I wonder if it is enough to check if it is holding the cpu_sync. Then we
> would not need @panic_triggering_all_cpu_backtrace.

I prefer to keep panic_triggering_all_cpu_backtrace. I know, it is an
ugly long name. But it clearly defines what we want to achieve.
And it limits the exception to printing the backtraces.

The check of the cpu_owner would work now because it is used basically
only for the backtraces. But it might change anytime in the future.
cpu_owner is a "generic" lock. I guess that it will be used
in more situations in the future. Any change might break this
scenario again...

Summary:

I prefer two patches:

  1st patch would allow storing the backtraces using the variable
     panic_triggering_all_cpu_backtrace (better name appreciated).

  2nd patch would cause flushing the backtraces. And I would use
     console_flush_on_panic(CONSOLE_FLUSH_PENDING) as a variant
     which can be backported to stable kernels. It might later
     be updated by the upcoming printk rework.

Best Regards,
Petr