linux-kernel - Re: [PATCH] printk: CPU backtrace not printing on panic

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-Id: <20240801082721.62935-1-takakura@valinux.co.jp>
Date: Thu,  1 Aug 2024 17:27:21 +0900
From: takakura@...inux.co.jp
To: pmladek@...e.com,
	john.ogness@...utronix.de
Cc: akpm@...ux-foundation.org,
	bhe@...hat.com,
	feng.tang@...el.com,
	j.granados@...sung.com,
	linux-kernel@...r.kernel.org,
	lukas@...ner.de,
	nishimura@...inux.co.jp,
	rostedt@...dmis.org,
	senozhatsky@...omium.org,
	stephen.s.brennan@...cle.com,
	taka@...inux.co.jp,
	takakura@...inux.co.jp,
	ubizjak@...il.com,
	wangkefeng.wang@...wei.com
Subject: Re: [PATCH] printk: CPU backtrace not printing on panic

Hi Petr and John,

On 2024-07-30, Petr Mladek <pmladek@...e.com> wrote:
>On Fri 2024-07-26 16:02:45, John Ogness wrote:
>> On 2024-07-26, Petr Mladek <pmladek@...e.com> wrote:
>> > I would do it the other way and enable printing from other CPUs only
>> > when triggring the backtrace. We could do it because
>> > trigger_all_cpu_backtrace() waits until all backtraces are
>> > printed.
>> >
>> > Something like:
>> >
>> > diff --git a/include/linux/panic.h b/include/linux/panic.h
>> > index 3130e0b5116b..980bacbdfcfc 100644
>> > --- a/include/linux/panic.h
>> > +++ b/include/linux/panic.h
>> > @@ -16,6 +16,7 @@ extern void oops_enter(void);
>> >  extern void oops_exit(void);
>> >  extern bool oops_may_print(void);
>> >  
>> > +extern int panic_triggering_all_cpu_backtrace;
>> >  extern int panic_timeout;
>> >  extern unsigned long panic_print;
>> >  extern int panic_on_oops;
>> > diff --git a/kernel/panic.c b/kernel/panic.c
>> > index f861bedc1925..7e9e97d59b1e 100644
>> > --- a/kernel/panic.c
>> > +++ b/kernel/panic.c
>> > @@ -64,6 +64,8 @@ unsigned long panic_on_taint;
>> >  bool panic_on_taint_nousertaint = false;
>> >  static unsigned int warn_limit __read_mostly;
>> >  
>> > +int panic_triggering_all_cpu_backtrace;
>> > +
>> >  int panic_timeout = CONFIG_PANIC_TIMEOUT;
>> >  EXPORT_SYMBOL_GPL(panic_timeout);
>> >  
>> > @@ -253,8 +255,12 @@ void check_panic_on_warn(const char *origin)
>> >   */
>> >  static void panic_other_cpus_shutdown(bool crash_kexec)
>> >  {
>> > -	if (panic_print & PANIC_PRINT_ALL_CPU_BT)
>> > +	if (panic_print & PANIC_PRINT_ALL_CPU_BT) {
>> > +		/* Temporary allow printing messages on non-panic CPUs. */
>> > +		panic_triggering_all_cpu_backtrace = true;
>> >  		trigger_all_cpu_backtrace();
>> > +		panic_triggering_all_cpu_backtrace = false;
>> 
>> Note, here we should also add
>> 
>> 		nbcon_atomic_flush_pending();
>> 
>> Your suggestion allows the other CPUs to dump their backtrace into the
>> ringbuffer, but they are still forbidden from acquiring the nbcon
>> console contexts for printing. That is a necessary requirement of
>> nbcon_waiter_matches().
>
>Great catch!
>
>I would prefer to solve this in a separate patch. This problem existed
>even before the commit 779dbc2e78d7 ("printk: Avoid non-panic CPUs writing
>to ringbuffer"). In fact, the problem existed very long time even for
>the legacy consoles.
>

Good point! I guess the problem existed since the commit 51a1d258e50e 
("printk: Keep non-panic-CPUs out of console lock") as it forbade the 
acquisition of console lock for non-panic cpus?

>It is pity that we need to handle both consoles separately. IMHO,
>we could get the same job done by calling
>
>	console_flush_on_panic(CONSOLE_FLUSH_PENDING);
>
>It flushes both nbcon and legacy consoles.
>
>> Or since the cpu_sync is held while printing the backtrace, we could
>> allow the non-panic CPUs to print by modifying the check in
>> nbcon_context_try_acquire_direct():
>> 
>> ----- BEGIN -----
>> diff --git a/kernel/printk/nbcon.c b/kernel/printk/nbcon.c
>> index ef6e76db0f5a..cd8724840edc 100644
>> --- a/kernel/printk/nbcon.c
>> +++ b/kernel/printk/nbcon.c
>> @@ -241,7 +241,7 @@ static int nbcon_context_try_acquire_direct(struct nbcon_context *ctxt,
>>  	struct nbcon_state new;
>>  
>>  	do {
>> -		if (other_cpu_in_panic())
>> +		if (other_cpu_in_panic() && !__printk_cpu_sync_owner())
>
>Interesting idea. I am not completely against it.
>
>Well, this would be the only situation when nmi_cpu_backtrace() would
>be allowed to flush the messages directly. Also it would be yet
>another exception.
>
>I would probably keep it simple and just flush the messages from
>the panic-CPU (using console_flush_on_panic(CONSOLE_FLUSH_PENDING).
>
>
>>  			return -EPERM;
>>  
>>  		if (ctxt->prio <= cur->prio || ctxt->prio <= cur->req_prio)
>> > --- a/kernel/printk/printk.c
>> > +++ b/kernel/printk/printk.c
>> > @@ -2316,7 +2316,7 @@ asmlinkage int vprintk_emit(int facility, int level,
>> >  	 * non-panic CPUs are generating any messages, they will be
>> >  	 * silently dropped.
>> >  	 */
>> > -	if (other_cpu_in_panic())
>> > +	if (other_cpu_in_panic() && !panic_triggering_all_cpu_backtrace)
>> >  		return 0;
>> 
>> I wonder if it is enough to check if it is holding the cpu_sync. Then we
>> would not need @panic_triggering_all_cpu_backtrace.
>
>I prefer to keep panic_triggering_all_cpu_backtrace. I know, it is an
>ugly long name. But it clearly defines what we want to achieve.
>And it limits the exception to printing the backtraces.
>
>The check of the cpu_owner would work now because it is used basically
>only for the backtraces. But it might change anytime in the future.
>cpu_owner is a "generic" lock. I guess that it will be used
>in more situations in the future. Any change might break this
>scenario again...
>

I agree that the checking of cpu_owner can be insufficient in the future and 
the use of panic_triggering_all_cpu_backtrace is more reliable in that sense.

>Summary:
>
>I prefer two patches:
>
>  1st patch would allow storing the backtraces using the variable
>     panic_triggering_all_cpu_backtrace (better name appreciated).
>
>  2nd patch would cause flushing the backtraces. And I would use
>     console_flush_on_panic(CONSOLE_FLUSH_PENDING) as a variant
>     which can be backported to stable kernels. It might later
>     be updated by the upcoming printk rework.
>
>Best Regards,
>Petr

Thanks! I'll prepare another patch based on it!

Sincerely,
Ryo Takakura