linux-kernel - RE: [PATCH printk v1] printk: nbcon: Allow reacquire during panic

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID:
 <SN6PR02MB4157602D9A334023667C5DE9D46AA@SN6PR02MB4157.namprd02.prod.outlook.com>
Date: Tue, 10 Jun 2025 00:10:33 +0000
From: Michael Kelley <mhklinux@...look.com>
To: John Ogness <john.ogness@...utronix.de>, Petr Mladek <pmladek@...e.com>
CC: Sergey Senozhatsky <senozhatsky@...omium.org>, Steven Rostedt
	<rostedt@...dmis.org>, Toshiyuki Sato <fj6611ie@...itsu.com>,
	"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>
Subject: RE: [PATCH printk v1] printk: nbcon: Allow reacquire during panic

From: John Ogness <john.ogness@...utronix.de> Sent: Friday, June 6, 2025 11:56 AM
> 
> If a console printer is interrupted during panic, it will never
> be able to reacquire ownership in order to perform and cleanup.
> That in itself is not a problem, since the non-panic CPU will
> simply quiesce in an endless loop within nbcon_reacquire_nobuf().
> 
> However, in this state, platforms that do not support a true NMI
> to interrupt the quiesced CPU will not be able to shutdown that
> CPU from within panic(). This then causes problems for such as
> being unable to load and run a kdump kernel.
> 
> Fix this by allowing non-panic CPUs to reacquire ownership using
> a direct acquire. Then the non-panic CPUs can successfullyl exit

s/successfullyl/successfully/

> the nbcon_reacquire_nobuf() loop and the console driver can
> perform any necessary cleanup. But more importantly, the CPU is
> no longer quiesced and is free to process any interrupts
> necessary for panic() to shutdown the CPU.
> 
> All other forms of acquire are still not allowed for non-panic
> CPUs since it is safer to have them avoid gaining console
> ownership that is not strictly necessary.
> 
> Reported-by: Michael Kelley <mhklinux@...look.com>
> Closes: https://lore.kernel.org/all/SN6PR02MB4157A4C5E8CB219A75263A17D46DA@SN6PR02MB4157.namprd02.prod.outlook.com/
> Signed-off-by: John Ogness <john.ogness@...utronix.de>
> ---
>  kernel/printk/nbcon.c | 63 ++++++++++++++++++++++++++++---------------
>  1 file changed, 41 insertions(+), 22 deletions(-)
> 
> diff --git a/kernel/printk/nbcon.c b/kernel/printk/nbcon.c
> index fd12efcc4aed..e7a3af81b173 100644
> --- a/kernel/printk/nbcon.c
> +++ b/kernel/printk/nbcon.c
> @@ -214,8 +214,9 @@ static void nbcon_seq_try_update(struct nbcon_context *ctxt, u64 new_seq)
> 
>  /**
>   * nbcon_context_try_acquire_direct - Try to acquire directly
> - * @ctxt:	The context of the caller
> - * @cur:	The current console state
> + * @ctxt:		The context of the caller
> + * @cur:		The current console state
> + * @is_reacquire:	This acquire is a reacquire
>   *
>   * Acquire the console when it is released. Also acquire the console when
>   * the current owner has a lower priority and the console is in a safe state.
> @@ -225,17 +226,17 @@ static void nbcon_seq_try_update(struct nbcon_context *ctxt, u64 new_seq)
>   *
>   * Errors:
>   *
> - *	-EPERM:		A panic is in progress and this is not the panic CPU.
> - *			Or the current owner or waiter has the same or higher
> - *			priority. No acquire method can be successful in
> - *			this case.
> + *	-EPERM:		A panic is in progress and this is neither the panic
> + *			CPU nor is this a reacquire. Or the current owner or
> + *			waiter has the same or higher priority. No acquire
> + *			method can be successful in these cases.
>   *
>   *	-EBUSY:		The current owner has a lower priority but the console
>   *			in an unsafe state. The caller should try using
>   *			the handover acquire method.
>   */
>  static int nbcon_context_try_acquire_direct(struct nbcon_context *ctxt,
> -					    struct nbcon_state *cur)
> +					    struct nbcon_state *cur, bool is_reacquire)
>  {
>  	unsigned int cpu = smp_processor_id();
>  	struct console *con = ctxt->console;
> @@ -243,14 +244,20 @@ static int nbcon_context_try_acquire_direct(struct nbcon_context *ctxt,
> 
>  	do {
>  		/*
> -		 * Panic does not imply that the console is owned. However, it
> -		 * is critical that non-panic CPUs during panic are unable to
> -		 * acquire ownership in order to satisfy the assumptions of
> -		 * nbcon_waiter_matches(). In particular, the assumption that
> -		 * lower priorities are ignored during panic.
> +		 * Panic does not imply that the console is owned. However,
> +		 * since all non-panic CPUs are stopped during panic(), it
> +		 * is safer to have them avoid gaining console ownership.
> +		 *
> +		 * If this acquire is a reacquire (and an unsafe takeover
> +		 * has not previously occurred) then it is allowed to attempt
> +		 * a direct acquire in panic. This gives console drivers an
> +		 * opportunity to perform any necessary cleanup if they were
> +		 * interrupted by the panic CPU while printing.
>  		 */
> -		if (other_cpu_in_panic())
> +		if (other_cpu_in_panic() &&
> +		    (!is_reacquire || cur->unsafe_takeover)) {
>  			return -EPERM;
> +		}
> 
>  		if (ctxt->prio <= cur->prio || ctxt->prio <= cur->req_prio)
>  			return -EPERM;
> @@ -301,8 +308,9 @@ static bool nbcon_waiter_matches(struct nbcon_state *cur, int expected_prio)
>  	 * Event #1 implies this context is EMERGENCY.
>  	 * Event #2 implies the new context is PANIC.
>  	 * Event #3 occurs when panic() has flushed the console.
> -	 * Events #4 and #5 are not possible due to the other_cpu_in_panic()
> -	 * check in nbcon_context_try_acquire_direct().
> +	 * Event #4 occurs when a non-panic CPU reacquires.
> +	 * Event #5 is not possible due to the other_cpu_in_panic() check
> +	 *          in nbcon_context_try_acquire_handover().
>  	 */
> 
>  	return (cur->req_prio == expected_prio);
> @@ -431,6 +439,16 @@ static int nbcon_context_try_acquire_handover(struct nbcon_context *ctxt,
>  	WARN_ON_ONCE(ctxt->prio <= cur->prio || ctxt->prio <= cur->req_prio);
>  	WARN_ON_ONCE(!cur->unsafe);
> 
> +	/*
> +	 * Panic does not imply that the console is owned. However, it
> +	 * is critical that non-panic CPUs during panic are unable to
> +	 * wait for a handover in order to satisfy the assumptions of
> +	 * nbcon_waiter_matches(). In particular, the assumption that
> +	 * lower priorities are ignored during panic.
> +	 */
> +	if (other_cpu_in_panic())
> +		return -EPERM;
> +
>  	/* Handover is not possible on the same CPU. */
>  	if (cur->cpu == cpu)
>  		return -EBUSY;
> @@ -558,7 +576,8 @@ static struct printk_buffers panic_nbcon_pbufs;
> 
>  /**
>   * nbcon_context_try_acquire - Try to acquire nbcon console
> - * @ctxt:	The context of the caller
> + * @ctxt:		The context of the caller
> + * @is_reacquire:	This acquire is a reacquire
>   *
>   * Context:	Under @ctxt->con->device_lock() or local_irq_save().
>   * Return:	True if the console was acquired. False otherwise.
> @@ -568,7 +587,7 @@ static struct printk_buffers panic_nbcon_pbufs;
>   * in an unsafe state. Otherwise, on success the caller may assume
>   * the console is not in an unsafe state.
>   */
> -static bool nbcon_context_try_acquire(struct nbcon_context *ctxt)
> +static bool nbcon_context_try_acquire(struct nbcon_context *ctxt, bool is_reacquire)
>  {
>  	unsigned int cpu = smp_processor_id();
>  	struct console *con = ctxt->console;
> @@ -577,7 +596,7 @@ static bool nbcon_context_try_acquire(struct nbcon_context *ctxt)
> 
>  	nbcon_state_read(con, &cur);
>  try_again:
> -	err = nbcon_context_try_acquire_direct(ctxt, &cur);
> +	err = nbcon_context_try_acquire_direct(ctxt, &cur, is_reacquire);
>  	if (err != -EBUSY)
>  		goto out;
> 
> @@ -913,7 +932,7 @@ void nbcon_reacquire_nobuf(struct nbcon_write_context *wctxt)
>  {
>  	struct nbcon_context *ctxt = &ACCESS_PRIVATE(wctxt, ctxt);
> 
> -	while (!nbcon_context_try_acquire(ctxt))
> +	while (!nbcon_context_try_acquire(ctxt, true))
>  		cpu_relax();
> 
>  	nbcon_write_context_set_buf(wctxt, NULL, 0);
> @@ -1101,7 +1120,7 @@ static bool nbcon_emit_one(struct nbcon_write_context *wctxt, bool use_atomic)
>  		cant_migrate();
>  	}
> 
> -	if (!nbcon_context_try_acquire(ctxt))
> +	if (!nbcon_context_try_acquire(ctxt, false))
>  		goto out;
> 
>  	/*
> @@ -1486,7 +1505,7 @@ static int __nbcon_atomic_flush_pending_con(struct console *con, u64 stop_seq,
>  	ctxt->prio			= nbcon_get_default_prio();
>  	ctxt->allow_unsafe_takeover	= allow_unsafe_takeover;
> 
> -	if (!nbcon_context_try_acquire(ctxt))
> +	if (!nbcon_context_try_acquire(ctxt, false))
>  		return -EPERM;
> 
>  	while (nbcon_seq_read(con) < stop_seq) {
> @@ -1762,7 +1781,7 @@ bool nbcon_device_try_acquire(struct console *con)
>  	ctxt->console	= con;
>  	ctxt->prio	= NBCON_PRIO_NORMAL;
> 
> -	if (!nbcon_context_try_acquire(ctxt))
> +	if (!nbcon_context_try_acquire(ctxt, false))
>  		return false;
> 
>  	if (!nbcon_context_enter_unsafe(ctxt))
> 

Tested this patch in a Dpv6-series ARM64 VM in the Azure cloud, and did not
see any occurrences of failing to stop a CPU like I originally reported. Custom
logging shows that the original problem scenario is still happening, and the
patch allows recovery. So the good result is not just due to a timing change
that avoids the original problem.

The looping behavior of the "pr/ttyAMA0" task in nbcon_kthread_func() is
the same as it was with Toshiyuki's original test patch.

Tested-by: Michael Kelley <mhklinux@...look.com>