[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID:
<SN6PR02MB41575419E1223B8A8B0A1F1FD46FA@SN6PR02MB4157.namprd02.prod.outlook.com>
Date: Thu, 5 Jun 2025 02:49:22 +0000
From: Michael Kelley <mhklinux@...look.com>
To: "Toshiyuki Sato (Fujitsu)" <fj6611ie@...itsu.com>, 'John Ogness'
<john.ogness@...utronix.de>
CC: "pmladek@...e.com" <pmladek@...e.com>, 'Ryo Takakura'
<ryotkkr98@...il.com>, Russell King <linux@...linux.org.uk>, Greg
Kroah-Hartman <gregkh@...uxfoundation.org>, Jiri Slaby
<jirislaby@...nel.org>, "linux-kernel@...r.kernel.org"
<linux-kernel@...r.kernel.org>, "linux-serial@...r.kernel.org"
<linux-serial@...r.kernel.org>, "linux-arm-kernel@...ts.infradead.org"
<linux-arm-kernel@...ts.infradead.org>
Subject: RE: Problem with nbcon console and amba-pl011 serial port
From: Toshiyuki Sato (Fujitsu) <fj6611ie@...itsu.com> Sent: Tuesday, June 3, 2025 9:11 PM
>
> Hi Michael, John,
>
[snip]
>
> This is a proposed fix to force termination by returning false from
> nbcon_reacquire_nobuf when a panic occurs within pl011_console_write_thread.
> (I believe this is similar to what John suggested in his previous reply.)
>
> While I couldn't reproduce the issue using sysrq-trigger in my environment
> (It seemed that the panic was being executed before the thread processing),
> I did observe nbcon_reacquire_nobuf failing to complete when injecting an
> NMI (SError) during pl011_console_write_thread.
> Applying this fix seems to have resolved the "SMP: failed to stop secondary
> CPUs" issue.
>
> This patch is for test.
> Modifications to imx and other drivers, as well as adding __must_check,
> will likely be required.
>
> Michael, could you please test this fix in your environment?
I've tested the fix in my primary environment (ARM64 VM in the Azure cloud),
and I've seen no failures to stop a CPU. I kept my custom logging in place, so I
could confirm that the problem path is still happening, and the fix recovers from
the problem path. So the good results are not due to just a timing change. The
"pr/ttyAMA0" task is still looping forever trying to get ownership of the console,
but it is doing so at a higher level in nbcon_kthread_func() and in calling
nbcon_emit_one(), and interrupts are enabled for part of the loop.
Full disclosure: I have a secondary environment, also an ARM64 VM in the
Azure cloud, but running on an older version of Hyper-V. In this environment
I see the same custom logging results, and the "pr/ttyAMA0" task is indeed
looping with interrupts enabled. But for some reason, the CPU doesn't stop
in response to IPI_CPU_STOP. I don't see any evidence that this failure to stop
is due to the Linux pl011 driver or nbcon. This older version of Hyper-V has a
known problem in pl011 UART emulation, and I have a theory on how that
problem may be causing the failure to stop. It will take me some time to
investigate further, but based on what I know now, that investigation should
not hold up this fix.
Michael
>
> Regards,
> Toshiyuki Sato
>
> diff --git a/drivers/tty/serial/amba-pl011.c b/drivers/tty/serial/amba-pl011.c
> index 11d650975..c3a2b22e6 100644
> --- a/drivers/tty/serial/amba-pl011.c
> +++ b/drivers/tty/serial/amba-pl011.c
> @@ -2577,8 +2577,10 @@ pl011_console_write_thread(struct console *co, struct nbcon_write_context *wctxt
> }
> }
>
> - while (!nbcon_enter_unsafe(wctxt))
> - nbcon_reacquire_nobuf(wctxt);
> + while (!nbcon_enter_unsafe(wctxt)) {
> + if (!nbcon_reacquire_nobuf(wctxt))
> + return;
> + }
>
> while ((pl011_read(uap, REG_FR) ^ uap->vendor->inv_fr) & uap->vendor->fr_busy)
> cpu_relax();
> diff --git a/include/linux/console.h b/include/linux/console.h
> index 8f10d0a85..ae278b875 100644
> --- a/include/linux/console.h
> +++ b/include/linux/console.h
> @@ -604,14 +604,14 @@ extern void nbcon_cpu_emergency_exit(void);
> extern bool nbcon_can_proceed(struct nbcon_write_context *wctxt);
> extern bool nbcon_enter_unsafe(struct nbcon_write_context *wctxt);
> extern bool nbcon_exit_unsafe(struct nbcon_write_context *wctxt);
> -extern void nbcon_reacquire_nobuf(struct nbcon_write_context *wctxt);
> +extern bool nbcon_reacquire_nobuf(struct nbcon_write_context *wctxt);
> #else
> static inline void nbcon_cpu_emergency_enter(void) { }
> static inline void nbcon_cpu_emergency_exit(void) { }
> static inline bool nbcon_can_proceed(struct nbcon_write_context *wctxt) { return false; }
> static inline bool nbcon_enter_unsafe(struct nbcon_write_context *wctxt) { return false; }
> static inline bool nbcon_exit_unsafe(struct nbcon_write_context *wctxt) { return false; }
> -static inline void nbcon_reacquire_nobuf(struct nbcon_write_context *wctxt) { }
> +static inline bool nbcon_reacquire_nobuf(struct nbcon_write_context *wctxt) { }
> #endif
>
> extern int console_set_on_cmdline;
> diff --git a/kernel/printk/nbcon.c b/kernel/printk/nbcon.c
> index fd12efcc4..ec016bb24 100644
> --- a/kernel/printk/nbcon.c
> +++ b/kernel/printk/nbcon.c
> @@ -909,14 +909,18 @@ EXPORT_SYMBOL_GPL(nbcon_exit_unsafe);
> * output buffer because that has been lost. This function cannot be used to
> * resume printing.
> */
> -void nbcon_reacquire_nobuf(struct nbcon_write_context *wctxt)
> +bool nbcon_reacquire_nobuf(struct nbcon_write_context *wctxt)
> {
> struct nbcon_context *ctxt = &ACCESS_PRIVATE(wctxt, ctxt);
>
> - while (!nbcon_context_try_acquire(ctxt))
> + while (!nbcon_context_try_acquire(ctxt)) {
> cpu_relax();
> + if (other_cpu_in_panic())
> + return false;
> + }
>
> nbcon_write_context_set_buf(wctxt, NULL, 0);
> + return true;
> }
> EXPORT_SYMBOL_GPL(nbcon_reacquire_nobuf);
Powered by blists - more mailing lists