linux-kernel - Re: printk: selective deactivation of feature ignoring non panic cpu's messages

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <8434fytakt.fsf@jogness.linutronix.de>
Date: Fri, 28 Feb 2025 15:26:34 +0106
From: John Ogness <john.ogness@...utronix.de>
To: Petr Mladek <pmladek@...e.com>
Cc: Donghyeok Choe <d7271.choe@...sung.com>, linux-kernel@...r.kernel.org,
 takakura@...inux.co.jp, youngmin.nam@...sung.com, hajun.sung@...sung.com,
 seungh.jung@...sung.com, jh1012.choi@...sung.com
Subject: Re: printk: selective deactivation of feature ignoring non panic
 cpu's messages

On 2025-02-26, Petr Mladek <pmladek@...e.com> wrote:
> I wonder if this is actually safe. I recall that we simplified the
> design somewhere because we expected that non-panic CPUs will not
> add messages.

Perhaps you are recalling commit 7412dc6d55ee ("dump_stack: Do not get
cpu_sync for panic CPU").

> I am not sure that I found all locations. But
> we might want to revise:
>
>
> 1st problem: _prb_read_valid() skips non-finalized records on non-panic CPUs.
>
>    opinion: We should not do it in this case.

I don't know. This could result in seeing even less output if some CPU
didn't finalize a record.

> 2nd problem: Is _prb_read_valid() actually safe when
> 	panic_triggering_all_cpu_backtrace is true?
>
>    opinion: It should be safe because the backtraces from different CPUs
> 	are serialized via printk_cpu_sync_get_irqsave().

To clarify, by "safe" you mean it does not skip backtrace records from
other CPUs.

It does not skip their records because trigger_all_cpu_backtrace() waits
(up to 10 seconds) for the other CPUs to finish storing their backtraces
before continuing.

The use of the printk_cpu_sync is only to avoid interweaving the
multiple non-panic CPU backtraces.

> 3rd problem: nbcon_get_default_prio() returns NBCON_PRIO_NORMAL on
> 	non-panic CPUs. As a result, printk_get_console_flush_type()
> 	would suggest flushing like when the system works as expected.
>
> 	But the legacy-loop will bail out after flushing one
> 	message on one console, see console_flush_all(). It is weird
> 	behavior.

I believe you are talking about commit 8ebc476fd51e ("printk: Drop
console_sem during panic")? And also be aware of commit 51a1d258e50e
("printk: Keep non-panic-CPUs out of console lock").

> 	Another question is who would flush the messages when the panic()
> 	CPU does not reach the explicit flush.

Nobody. That is by design.

>    opinion: We should probably try to flush the messages on non-panic
> 	CPUs in this mode when safe. This is why I prefer the name
> 	"printk_debug_non_panic_cpus".
>
> 	We should update console_flush_all() to do not bail out when
> 	the new option is set.

It is not that simple. Legacy printing involves acquiring the
console_lock, which currently is not possible for non-panic CPUs during
panic.

> 	We should call nbcon_atomic_flush_pending() on non-panic CPUs
> 	when the new option is set. printk_get_console_flush_type()
> 	should behave like with NBCON_PRIO_EMERGENCY.

Note that non-panic CPUs during panic are also forbidden from acquiring
console ownership.

> 	Maybe, nbcon_get_default_prio() should actually return
> 	NBCON_PRIO_EMERGENCY on non-panic CPUs when this option is set.
> 	It allow the non-panic CPUs to take over the nbcon context
> 	from the potentially frozen kthread.

Note that nbcon_waiter_matches() requires that "Only one CPU is allowed
to request PANIC priority."

IMO it is opening up a huge can of worms to start allowing non-panic
CPUs to acquire the console_lock, take over console ownership, and print
(possibly with PANIC priority).

You could argue that this is just a debug mode. But I don't think that
is justification to allow things to explode and possibly deadlock, when
they otherwise would not have.

Perhaps Donghyeok might be happy enough if the debug option simply
allowed the non-panic CPUs to store records. There are plenty of tools
available to get at the dmesg buffer.

John