linux-kernel - RE: Problem with nbcon console and amba-pl011 serial port

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID:
 <SN6PR02MB41575419E1223B8A8B0A1F1FD46FA@SN6PR02MB4157.namprd02.prod.outlook.com>
Date: Thu, 5 Jun 2025 02:49:22 +0000
From: Michael Kelley <mhklinux@...look.com>
To: "Toshiyuki Sato (Fujitsu)" <fj6611ie@...itsu.com>, 'John Ogness'
	<john.ogness@...utronix.de>
CC: "pmladek@...e.com" <pmladek@...e.com>, 'Ryo Takakura'
	<ryotkkr98@...il.com>, Russell King <linux@...linux.org.uk>, Greg
 Kroah-Hartman <gregkh@...uxfoundation.org>, Jiri Slaby
	<jirislaby@...nel.org>, "linux-kernel@...r.kernel.org"
	<linux-kernel@...r.kernel.org>, "linux-serial@...r.kernel.org"
	<linux-serial@...r.kernel.org>, "linux-arm-kernel@...ts.infradead.org"
	<linux-arm-kernel@...ts.infradead.org>
Subject: RE: Problem with nbcon console and amba-pl011 serial port

From: Toshiyuki Sato (Fujitsu) <fj6611ie@...itsu.com> Sent: Tuesday, June 3, 2025 9:11 PM
> 
> Hi Michael, John,
> 

[snip]

> 
> This is a proposed fix to force termination by returning false from
> nbcon_reacquire_nobuf when a panic occurs within pl011_console_write_thread.
>  (I believe this is similar to what John suggested in his previous reply.)
> 
> While I couldn't reproduce the issue using sysrq-trigger in my environment
>  (It seemed that the panic was being executed before the thread processing),
> I did observe nbcon_reacquire_nobuf failing to complete when injecting an
> NMI (SError) during pl011_console_write_thread.
> Applying this fix seems to have resolved the "SMP: failed to stop secondary
> CPUs" issue.
> 
> This patch is for test.
> Modifications to imx and other drivers, as well as adding __must_check,
> will likely be required.
> 
> Michael, could you please test this fix in your environment?

I've tested the fix in my primary environment (ARM64 VM in the Azure cloud),
and I've seen no failures to stop a CPU. I kept my custom logging in place, so I
could confirm that the problem path is still happening, and the fix recovers from
the problem path. So the good results are not due to just a timing change. The
"pr/ttyAMA0" task is still looping forever trying to get ownership of the console,
but it is doing so at a higher level in nbcon_kthread_func() and in calling
nbcon_emit_one(), and interrupts are enabled for part of the loop.

Full disclosure: I have a secondary environment, also an ARM64 VM in the
Azure cloud, but running on an older version of Hyper-V. In this environment
I see the same custom logging results, and the "pr/ttyAMA0" task is indeed
looping with interrupts enabled. But for some reason, the CPU doesn't stop
in response to IPI_CPU_STOP. I don't see any evidence that this failure to stop
is due to the Linux pl011 driver or nbcon. This older version of Hyper-V has a
known problem in pl011 UART emulation, and I have a theory on how that
problem may be causing the failure to stop. It will take me some time to
investigate further, but based on what I know now, that investigation should
not hold up this fix.

Michael

> 
> Regards,
> Toshiyuki Sato
> 
> diff --git a/drivers/tty/serial/amba-pl011.c b/drivers/tty/serial/amba-pl011.c
> index 11d650975..c3a2b22e6 100644
> --- a/drivers/tty/serial/amba-pl011.c
> +++ b/drivers/tty/serial/amba-pl011.c
> @@ -2577,8 +2577,10 @@ pl011_console_write_thread(struct console *co, struct nbcon_write_context *wctxt
>  		}
>  	}
> 
> -	while (!nbcon_enter_unsafe(wctxt))
> -		nbcon_reacquire_nobuf(wctxt);
> +	while (!nbcon_enter_unsafe(wctxt)) {
> +		if (!nbcon_reacquire_nobuf(wctxt))
> +			return;
> +	}
> 
>  	while ((pl011_read(uap, REG_FR) ^ uap->vendor->inv_fr) & uap->vendor->fr_busy)
>  		cpu_relax();
> diff --git a/include/linux/console.h b/include/linux/console.h
> index 8f10d0a85..ae278b875 100644
> --- a/include/linux/console.h
> +++ b/include/linux/console.h
> @@ -604,14 +604,14 @@ extern void nbcon_cpu_emergency_exit(void);
>  extern bool nbcon_can_proceed(struct nbcon_write_context *wctxt);
>  extern bool nbcon_enter_unsafe(struct nbcon_write_context *wctxt);
>  extern bool nbcon_exit_unsafe(struct nbcon_write_context *wctxt);
> -extern void nbcon_reacquire_nobuf(struct nbcon_write_context *wctxt);
> +extern bool nbcon_reacquire_nobuf(struct nbcon_write_context *wctxt);
>  #else
>  static inline void nbcon_cpu_emergency_enter(void) { }
>  static inline void nbcon_cpu_emergency_exit(void) { }
>  static inline bool nbcon_can_proceed(struct nbcon_write_context *wctxt) { return false; }
>  static inline bool nbcon_enter_unsafe(struct nbcon_write_context *wctxt) { return false; }
>  static inline bool nbcon_exit_unsafe(struct nbcon_write_context *wctxt) { return false; }
> -static inline void nbcon_reacquire_nobuf(struct nbcon_write_context *wctxt) { }
> +static inline bool nbcon_reacquire_nobuf(struct nbcon_write_context *wctxt) { }
>  #endif
> 
>  extern int console_set_on_cmdline;
> diff --git a/kernel/printk/nbcon.c b/kernel/printk/nbcon.c
> index fd12efcc4..ec016bb24 100644
> --- a/kernel/printk/nbcon.c
> +++ b/kernel/printk/nbcon.c
> @@ -909,14 +909,18 @@ EXPORT_SYMBOL_GPL(nbcon_exit_unsafe);
>   * output buffer because that has been lost. This function cannot be used to
>   * resume printing.
>   */
> -void nbcon_reacquire_nobuf(struct nbcon_write_context *wctxt)
> +bool nbcon_reacquire_nobuf(struct nbcon_write_context *wctxt)
>  {
>  	struct nbcon_context *ctxt = &ACCESS_PRIVATE(wctxt, ctxt);
> 
> -	while (!nbcon_context_try_acquire(ctxt))
> +	while (!nbcon_context_try_acquire(ctxt)) {
>  		cpu_relax();
> +		if (other_cpu_in_panic())
> +			return false;
> +	}
> 
>  	nbcon_write_context_set_buf(wctxt, NULL, 0);
> +	return true;
>  }
>  EXPORT_SYMBOL_GPL(nbcon_reacquire_nobuf);