[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID:
<TY4PR01MB13777674C22721FCD8ACF4FCCD76CA@TY4PR01MB13777.jpnprd01.prod.outlook.com>
Date: Wed, 4 Jun 2025 04:11:10 +0000
From: "Toshiyuki Sato (Fujitsu)" <fj6611ie@...itsu.com>
To: 'Michael Kelley' <mhklinux@...look.com>, 'John Ogness'
<john.ogness@...utronix.de>
CC: "pmladek@...e.com" <pmladek@...e.com>, 'Ryo Takakura'
<ryotkkr98@...il.com>, Russell King <linux@...linux.org.uk>, Greg
Kroah-Hartman <gregkh@...uxfoundation.org>, Jiri Slaby
<jirislaby@...nel.org>, "linux-kernel@...r.kernel.org"
<linux-kernel@...r.kernel.org>, "linux-serial@...r.kernel.org"
<linux-serial@...r.kernel.org>, "linux-arm-kernel@...ts.infradead.org"
<linux-arm-kernel@...ts.infradead.org>, "Toshiyuki Sato (Fujitsu)"
<fj6611ie@...itsu.com>
Subject: RE: Problem with nbcon console and amba-pl011 serial port
Hi Michael, John,
> Hi Petr,
>
> On 2025-06-03, John Ogness <john.ogness@...utronix.de> wrote:
> > On 2025-06-03, "Toshiyuki Sato (Fujitsu)" <fj6611ie@...itsu.com> wrote:
> >>> 4. pr_emerg() has a high logging level, and it effectively steals the console
> >>> from the "pr/ttyAMA0" task, which I believe is intentional in the nbcon
> design.
> >>> Down in pl011_console_write_thread(), the "pr/ttyAMA0" task is doing
> >>> nbcon_enter_unsafe() and nbcon_exit_unsafe() around each character
> >>> that it outputs. When pr_emerg() steals the console, nbcon_exit_unsafe()
> >>> returns 0, so the "for" loop exits. pl011_console_write_thread() then
> >>> enters a busy "while" loop waiting to reclaim the console. It's doing this
> >>> busy "while" loop with interrupts disabled, and because of the panic,
> >>> it never succeeds. Whatever CPU is running "pr/ttyAMA0" is effectively
> >>> stuck at this point.
> >>>
> >>> 5. Meanwhile panic() continues, calling panic_other_cpus_shutdown(). On
> >>> ARM64, other CPUs are stopped by sending them an IPI. Each CPU receives
> >>> the IPI and calls the PSCI function to stop itself. But the CPU running
> >>> "pr/ttyAMA0" is looping forever with interrupts disabled, so it never
> >>> processes the IPI and it never stops. ARM64 doesn't have a true NMI that
> >>> can override the looping with interrupts disabled, so there's no way to
> >>> stop that CPU.
> >>>
> >>> 6. The failure to stop the "pr/ttyAMA0" CPU then causes downstream
> >>> problems, such as when loading and running a kdump kernel.
> >
> > [...]
> >
> >> After reproducing the issue,
> >> I plan to try a workaround that forcibly terminates the nbcon_reacquire_nobuf
> >> loop in pl011_console_write_thread if other_cpu_in_panic is true.
> >> Please comment if you have any other ideas.
> >
> > For panic, if it is OK to leave uap->clk enabled and not restore REG_CR,
> > then it should be fine to just return. But only for panic.
> >
> > So something like:
> >
> > while (!nbcon_enter_unsafe(wctxt)) {
> > if (other_cpu_in_panic())
> > return;
> > nbcon_reacquire_nobuf(wctxt);
> > }
>
> Actually this is not enough because there is also a loop inside
> nbcon_reacquire_nobuf().
>
> nbcon_reacquire_nobuf() needs to return an error for the panic case
> because it will never succeed. This is the only case where it will never
> succeed. Should we use a bool? Or return some code like -EPERM?
>
> So the above code becomes:
>
> while (!nbcon_enter_unsafe(wctxt)) {
> if (!nbcon_reacquire_nobuf(wctxt))
> return;
> }
>
> We should also add __must_check to the prototype.
>
> Thoughts?
>
> John
Thanks for the suggestion, John.
This is a proposed fix to force termination by returning false from
nbcon_reacquire_nobuf when a panic occurs within pl011_console_write_thread.
(I believe this is similar to what John suggested in his previous reply.)
While I couldn't reproduce the issue using sysrq-trigger in my environment
(It seemed that the panic was being executed before the thread processing),
I did observe nbcon_reacquire_nobuf failing to complete when injecting an
NMI (SError) during pl011_console_write_thread.
Applying this fix seems to have resolved the "SMP: failed to stop secondary
CPUs" issue.
This patch is for test.
Modifications to imx and other drivers, as well as adding __must_check,
will likely be required.
Michael, could you please test this fix in your environment?
Regards,
Toshiyuki Sato
diff --git a/drivers/tty/serial/amba-pl011.c b/drivers/tty/serial/amba-pl011.c
index 11d650975..c3a2b22e6 100644
--- a/drivers/tty/serial/amba-pl011.c
+++ b/drivers/tty/serial/amba-pl011.c
@@ -2577,8 +2577,10 @@ pl011_console_write_thread(struct console *co, struct nbcon_write_context *wctxt
}
}
- while (!nbcon_enter_unsafe(wctxt))
- nbcon_reacquire_nobuf(wctxt);
+ while (!nbcon_enter_unsafe(wctxt)) {
+ if (!nbcon_reacquire_nobuf(wctxt))
+ return;
+ }
while ((pl011_read(uap, REG_FR) ^ uap->vendor->inv_fr) & uap->vendor->fr_busy)
cpu_relax();
diff --git a/include/linux/console.h b/include/linux/console.h
index 8f10d0a85..ae278b875 100644
--- a/include/linux/console.h
+++ b/include/linux/console.h
@@ -604,14 +604,14 @@ extern void nbcon_cpu_emergency_exit(void);
extern bool nbcon_can_proceed(struct nbcon_write_context *wctxt);
extern bool nbcon_enter_unsafe(struct nbcon_write_context *wctxt);
extern bool nbcon_exit_unsafe(struct nbcon_write_context *wctxt);
-extern void nbcon_reacquire_nobuf(struct nbcon_write_context *wctxt);
+extern bool nbcon_reacquire_nobuf(struct nbcon_write_context *wctxt);
#else
static inline void nbcon_cpu_emergency_enter(void) { }
static inline void nbcon_cpu_emergency_exit(void) { }
static inline bool nbcon_can_proceed(struct nbcon_write_context *wctxt) { return false; }
static inline bool nbcon_enter_unsafe(struct nbcon_write_context *wctxt) { return false; }
static inline bool nbcon_exit_unsafe(struct nbcon_write_context *wctxt) { return false; }
-static inline void nbcon_reacquire_nobuf(struct nbcon_write_context *wctxt) { }
+static inline bool nbcon_reacquire_nobuf(struct nbcon_write_context *wctxt) { }
#endif
extern int console_set_on_cmdline;
diff --git a/kernel/printk/nbcon.c b/kernel/printk/nbcon.c
index fd12efcc4..ec016bb24 100644
--- a/kernel/printk/nbcon.c
+++ b/kernel/printk/nbcon.c
@@ -909,14 +909,18 @@ EXPORT_SYMBOL_GPL(nbcon_exit_unsafe);
* output buffer because that has been lost. This function cannot be used to
* resume printing.
*/
-void nbcon_reacquire_nobuf(struct nbcon_write_context *wctxt)
+bool nbcon_reacquire_nobuf(struct nbcon_write_context *wctxt)
{
struct nbcon_context *ctxt = &ACCESS_PRIVATE(wctxt, ctxt);
- while (!nbcon_context_try_acquire(ctxt))
+ while (!nbcon_context_try_acquire(ctxt)) {
cpu_relax();
+ if (other_cpu_in_panic())
+ return false;
+ }
nbcon_write_context_set_buf(wctxt, NULL, 0);
+ return true;
}
EXPORT_SYMBOL_GPL(nbcon_reacquire_nobuf);
Powered by blists - more mailing lists