linux-kernel - RE: Problem with nbcon console and amba-pl011 serial port

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID:
 <TY4PR01MB13777674C22721FCD8ACF4FCCD76CA@TY4PR01MB13777.jpnprd01.prod.outlook.com>
Date: Wed, 4 Jun 2025 04:11:10 +0000
From: "Toshiyuki Sato (Fujitsu)" <fj6611ie@...itsu.com>
To: 'Michael Kelley' <mhklinux@...look.com>, 'John Ogness'
	<john.ogness@...utronix.de>
CC: "pmladek@...e.com" <pmladek@...e.com>, 'Ryo Takakura'
	<ryotkkr98@...il.com>, Russell King <linux@...linux.org.uk>, Greg
 Kroah-Hartman <gregkh@...uxfoundation.org>, Jiri Slaby
	<jirislaby@...nel.org>, "linux-kernel@...r.kernel.org"
	<linux-kernel@...r.kernel.org>, "linux-serial@...r.kernel.org"
	<linux-serial@...r.kernel.org>, "linux-arm-kernel@...ts.infradead.org"
	<linux-arm-kernel@...ts.infradead.org>, "Toshiyuki Sato (Fujitsu)"
	<fj6611ie@...itsu.com>
Subject: RE: Problem with nbcon console and amba-pl011 serial port

Hi Michael, John,

> Hi Petr,
> 
> On 2025-06-03, John Ogness <john.ogness@...utronix.de> wrote:
> > On 2025-06-03, "Toshiyuki Sato (Fujitsu)" <fj6611ie@...itsu.com> wrote:
> >>> 4. pr_emerg() has a high logging level, and it effectively steals the console
> >>> from the "pr/ttyAMA0" task, which I believe is intentional in the nbcon
> design.
> >>> Down in pl011_console_write_thread(), the "pr/ttyAMA0" task is doing
> >>> nbcon_enter_unsafe() and nbcon_exit_unsafe() around each character
> >>> that it outputs.  When pr_emerg() steals the console, nbcon_exit_unsafe()
> >>> returns 0, so the "for" loop exits. pl011_console_write_thread() then
> >>> enters a busy "while" loop waiting to reclaim the console. It's doing this
> >>> busy "while" loop with interrupts disabled, and because of the panic,
> >>> it never succeeds. Whatever CPU is running "pr/ttyAMA0" is effectively
> >>> stuck at this point.
> >>>
> >>> 5. Meanwhile panic() continues, calling panic_other_cpus_shutdown(). On
> >>> ARM64, other CPUs are stopped by sending them an IPI. Each CPU receives
> >>> the IPI and calls the PSCI function to stop itself. But the CPU running
> >>> "pr/ttyAMA0" is looping forever with interrupts disabled, so it never
> >>> processes the IPI and it never stops. ARM64 doesn't have a true NMI that
> >>> can override the looping with interrupts disabled, so there's no way to
> >>> stop that CPU.
> >>>
> >>> 6. The failure to stop the "pr/ttyAMA0" CPU then causes downstream
> >>> problems, such as when loading and running a kdump kernel.
> >
> > [...]
> >
> >> After reproducing the issue,
> >> I plan to try a workaround that forcibly terminates the nbcon_reacquire_nobuf
> >> loop in pl011_console_write_thread if other_cpu_in_panic is true.
> >> Please comment if you have any other ideas.
> >
> > For panic, if it is OK to leave uap->clk enabled and not restore REG_CR,
> > then it should be fine to just return. But only for panic.
> >
> > So something like:
> >
> > 	while (!nbcon_enter_unsafe(wctxt)) {
> > 		if (other_cpu_in_panic())
> > 			return;
> > 		nbcon_reacquire_nobuf(wctxt);
> > 	}
> 
> Actually this is not enough because there is also a loop inside
> nbcon_reacquire_nobuf().
> 
> nbcon_reacquire_nobuf() needs to return an error for the panic case
> because it will never succeed. This is the only case where it will never
> succeed. Should we use a bool? Or return some code like -EPERM?
> 
> So the above code becomes:
> 
>  	while (!nbcon_enter_unsafe(wctxt)) {
>  		if (!nbcon_reacquire_nobuf(wctxt))
>  			return;
>  	}
> 
> We should also add __must_check to the prototype.
> 
> Thoughts?
> 
> John

Thanks for the suggestion, John.

This is a proposed fix to force termination by returning false from
nbcon_reacquire_nobuf when a panic occurs within pl011_console_write_thread.
 (I believe this is similar to what John suggested in his previous reply.)

While I couldn't reproduce the issue using sysrq-trigger in my environment
 (It seemed that the panic was being executed before the thread processing),
I did observe nbcon_reacquire_nobuf failing to complete when injecting an
NMI (SError) during pl011_console_write_thread. 
Applying this fix seems to have resolved the "SMP: failed to stop secondary
CPUs" issue.

This patch is for test.
Modifications to imx and other drivers, as well as adding __must_check, 
will likely be required.

Michael, could you please test this fix in your environment?

Regards,
Toshiyuki Sato

diff --git a/drivers/tty/serial/amba-pl011.c b/drivers/tty/serial/amba-pl011.c
index 11d650975..c3a2b22e6 100644
--- a/drivers/tty/serial/amba-pl011.c
+++ b/drivers/tty/serial/amba-pl011.c
@@ -2577,8 +2577,10 @@ pl011_console_write_thread(struct console *co, struct nbcon_write_context *wctxt
 		}
 	}
 
-	while (!nbcon_enter_unsafe(wctxt))
-		nbcon_reacquire_nobuf(wctxt);
+	while (!nbcon_enter_unsafe(wctxt)) {
+		if (!nbcon_reacquire_nobuf(wctxt))
+			return;
+	}
 
 	while ((pl011_read(uap, REG_FR) ^ uap->vendor->inv_fr) & uap->vendor->fr_busy)
 		cpu_relax();
diff --git a/include/linux/console.h b/include/linux/console.h
index 8f10d0a85..ae278b875 100644
--- a/include/linux/console.h
+++ b/include/linux/console.h
@@ -604,14 +604,14 @@ extern void nbcon_cpu_emergency_exit(void);
 extern bool nbcon_can_proceed(struct nbcon_write_context *wctxt);
 extern bool nbcon_enter_unsafe(struct nbcon_write_context *wctxt);
 extern bool nbcon_exit_unsafe(struct nbcon_write_context *wctxt);
-extern void nbcon_reacquire_nobuf(struct nbcon_write_context *wctxt);
+extern bool nbcon_reacquire_nobuf(struct nbcon_write_context *wctxt);
 #else
 static inline void nbcon_cpu_emergency_enter(void) { }
 static inline void nbcon_cpu_emergency_exit(void) { }
 static inline bool nbcon_can_proceed(struct nbcon_write_context *wctxt) { return false; }
 static inline bool nbcon_enter_unsafe(struct nbcon_write_context *wctxt) { return false; }
 static inline bool nbcon_exit_unsafe(struct nbcon_write_context *wctxt) { return false; }
-static inline void nbcon_reacquire_nobuf(struct nbcon_write_context *wctxt) { }
+static inline bool nbcon_reacquire_nobuf(struct nbcon_write_context *wctxt) { }
 #endif
 
 extern int console_set_on_cmdline;
diff --git a/kernel/printk/nbcon.c b/kernel/printk/nbcon.c
index fd12efcc4..ec016bb24 100644
--- a/kernel/printk/nbcon.c
+++ b/kernel/printk/nbcon.c
@@ -909,14 +909,18 @@ EXPORT_SYMBOL_GPL(nbcon_exit_unsafe);
  * output buffer because that has been lost. This function cannot be used to
  * resume printing.
  */
-void nbcon_reacquire_nobuf(struct nbcon_write_context *wctxt)
+bool nbcon_reacquire_nobuf(struct nbcon_write_context *wctxt)
 {
 	struct nbcon_context *ctxt = &ACCESS_PRIVATE(wctxt, ctxt);
 
-	while (!nbcon_context_try_acquire(ctxt))
+	while (!nbcon_context_try_acquire(ctxt)) {
 		cpu_relax();
+		if (other_cpu_in_panic())
+			return false;
+	}
 
 	nbcon_write_context_set_buf(wctxt, NULL, 0);
+	return true;
 }
 EXPORT_SYMBOL_GPL(nbcon_reacquire_nobuf);