linux-kernel - RE: Problem with nbcon console and amba-pl011 serial port

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <84plfl5bf1.fsf@jogness.linutronix.de>
Date: Tue, 03 Jun 2025 13:15:38 +0206
From: John Ogness <john.ogness@...utronix.de>
To: pmladek@...e.com
Cc: "Toshiyuki Sato (Fujitsu)" <fj6611ie@...itsu.com>,
	'Michael Kelley' <mhklinux@...look.com>,
	'Ryo Takakura' <ryotkkr98@...il.com>,
	Russell King <linux@...linux.org.uk>,
        Greg Kroah-Hartman <gregkh@...uxfoundation.org>,
	Jiri Slaby <jirislaby@...nel.org>,
	"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
	"linux-serial@...r.kernel.org" <linux-serial@...r.kernel.org>,
	"linux-arm-kernel@...ts.infradead.org" <linux-arm-kernel@...ts.infradead.org>
Subject: RE: Problem with nbcon console and amba-pl011 serial port

Hi Petr,

On 2025-06-03, John Ogness <john.ogness@...utronix.de> wrote:
> On 2025-06-03, "Toshiyuki Sato (Fujitsu)" <fj6611ie@...itsu.com> wrote:
>>> 4. pr_emerg() has a high logging level, and it effectively steals the console
>>> from the "pr/ttyAMA0" task, which I believe is intentional in the nbcon design.
>>> Down in pl011_console_write_thread(), the "pr/ttyAMA0" task is doing
>>> nbcon_enter_unsafe() and nbcon_exit_unsafe() around each character
>>> that it outputs.  When pr_emerg() steals the console, nbcon_exit_unsafe()
>>> returns 0, so the "for" loop exits. pl011_console_write_thread() then
>>> enters a busy "while" loop waiting to reclaim the console. It's doing this
>>> busy "while" loop with interrupts disabled, and because of the panic,
>>> it never succeeds. Whatever CPU is running "pr/ttyAMA0" is effectively
>>> stuck at this point.
>>> 
>>> 5. Meanwhile panic() continues, calling panic_other_cpus_shutdown(). On
>>> ARM64, other CPUs are stopped by sending them an IPI. Each CPU receives
>>> the IPI and calls the PSCI function to stop itself. But the CPU running
>>> "pr/ttyAMA0" is looping forever with interrupts disabled, so it never
>>> processes the IPI and it never stops. ARM64 doesn't have a true NMI that
>>> can override the looping with interrupts disabled, so there's no way to
>>> stop that CPU.
>>> 
>>> 6. The failure to stop the "pr/ttyAMA0" CPU then causes downstream
>>> problems, such as when loading and running a kdump kernel.
>
> [...]
>
>> After reproducing the issue, 
>> I plan to try a workaround that forcibly terminates the nbcon_reacquire_nobuf
>> loop in pl011_console_write_thread if other_cpu_in_panic is true.
>> Please comment if you have any other ideas.
>
> For panic, if it is OK to leave uap->clk enabled and not restore REG_CR,
> then it should be fine to just return. But only for panic.
>
> So something like:
>
> 	while (!nbcon_enter_unsafe(wctxt)) {
> 		if (other_cpu_in_panic())
> 			return;
> 		nbcon_reacquire_nobuf(wctxt);
> 	}

Actually this is not enough because there is also a loop inside
nbcon_reacquire_nobuf().

nbcon_reacquire_nobuf() needs to return an error for the panic case
because it will never succeed. This is the only case where it will never
succeed. Should we use a bool? Or return some code like -EPERM?

So the above code becomes:

 	while (!nbcon_enter_unsafe(wctxt)) {
 		if (!nbcon_reacquire_nobuf(wctxt))
 			return;
 	}

We should also add __must_check to the prototype.

Thoughts?

John