linux-kernel - Re: [PATCH v2 3/3] kdb: Adapt kdb_msg

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <b32657500ac97857115bd93acd94595172efcd8d.camel@suse.com>
Date: Fri, 29 Aug 2025 11:01:02 -0300
From: Marcos Paulo de Souza <mpdesouza@...e.com>
To: Petr Mladek <pmladek@...e.com>, John Ogness <john.ogness@...utronix.de>
Cc: Daniel Thompson <daniel@...cstar.com>, Greg Kroah-Hartman	
 <gregkh@...uxfoundation.org>, Steven Rostedt <rostedt@...dmis.org>, Sergey
 Senozhatsky <senozhatsky@...omium.org>, Jason Wessel
 <jason.wessel@...driver.com>, Daniel Thompson	 <danielt@...nel.org>,
 Douglas Anderson <dianders@...omium.org>, 	linux-kernel@...r.kernel.org,
 kgdb-bugreport@...ts.sourceforge.net
Subject: Re: [PATCH v2 3/3] kdb: Adapt kdb_msg_write to work with NBCON
 consoles

On Fri, 2025-08-29 at 15:15 +0200, Petr Mladek wrote:
> On Tue 2025-08-26 17:17:32, John Ogness wrote:
> > On 2025-08-11, Marcos Paulo de Souza <mpdesouza@...e.com> wrote:
> > > On Mon, 2025-08-11 at 15:38 +0100, Daniel Thompson wrote:
> > > > On Mon, Aug 11, 2025 at 10:32:47AM -0300, Marcos Paulo de Souza
> > > > wrote:
> > > > > Function kdb_msg_write was calling con->write for any found
> > > > > console, but it won't work on NBCON ones. In this case we
> > > > > should
> > > > > acquire the ownership of the console using
> > > > > NBCON_PRIO_EMERGENCY,
> > > > > since printing kdb messages should only be interrupted by a
> > > > > panic. This is done by the nbcon_kdb_{acquire,release}
> > > > > functions.
> > > > 
> > > > Just wanted to check what it means to be "interrupted by a
> > > > panic"?
> > > > 
> > > > kdb is called from the panic handler but, assuming the serial
> > > > port is
> > > > run syncrhonously when "bad things are happening", the serial
> > > > port
> > > > should be quiet when we enter kdb meaning we can still acquire
> > > > ownership of the console?
> > > 
> > > TBH I haven't thought about that. I talked with Petr Mladek about
> > > it,
> > > and we agreed to have something similar to
> > > nbcon_device_try_acquire,
> > > but with a higher priority, to make sure that we would get the
> > > context
> > > at this point. But, when thinking about it, since KDB runs on the
> > > panic
> > > handler, so we I'm not sure even if we need the _enter_unsafe()
> > > call at
> > > this point, since nobody is going to interrupt either way.
> > 
> > Well, kdb can also run when not in panic. In that case, if a panic
> > occurs while using kdb, those panic messages should be printed
> > directly
> > on the consoles.
> > 
> > Also be aware that the kdb interface is using dbg_io_ops-
> > >write_char()
> > for printing and it will ignore a console that matches
> > dbg_io_ops->cons. So there is no concern about the kdb interface
> > conflicting with the nbcon console. This is just about the mirrored
> > kdb
> > output.
> 
> I was a bit confused by the "mirrored kdb output". It was a new term
> ;-)
> Let me try to summarize how I see it. Take it with a caution because
> I
> am not much familiar with the kdb code.
> 
> vkdb_printf() API
> -----------------
> 
> It serializes calls using a custom recursive CPU lock
> (kdb_printf_cpu)
> 
> It shows/stores the messages using several interfaces:
> 
>      a) gdbstub_msg_write() probably shows the message inside
> 	an attached gdb debugger.
> 
>      b) kdb_msg_write() shows the message on the console where kdb
> 	is attached via dbg_io_ops->write_char()
> 
>      c) kdb_msg_write() also writes the message on all other consoles
> 	registered by printk. I guess that this is what John meant
> 	by mirroring.
> 
>      d) printk()/pr_info() stores the messages into printk log buffer
> 	and eventually shows them on printk consoles. It is called
> 	only when @logging is enabled
> 
> Note that either gdbstub_msg_write() or kdb_msg_write() is called.
> Also I guess that @logging can be enabled only when gdb debugger is
> attached. kdbgetintenv() most likely returns KDB_NOTENV when gdb is
> not attached.
> 
>   => vkdb_printf() shows the message on only once on consoles:
> 
>       + via kdb_msg_write() when gdb is not attached
> 
>       + via printk() when gdb is attached and logging is enabled
> 	by setting LOGGING= environment variable
> 
> 
> vkdb_printf() context
> ---------------------
> 
> vkdb_printf() is called when entering or being inside kdb code.
> I guess that we might end there via:
> 
>    + int3 (break point)
>    + kgdb_panic() called in panic()
>    + NMI ???
> 
> I think that it might be called even before kdb synchronizes
> all CPUs into some quiescent mode. Otherwise, the custom CPU
> synchronization would not be needed.
> 
> For example, I guess that the CPUs are not synchronized here:
> 
> void kgdb_panic(const char *msg)
> {
> [...]
> 	if (dbg_kdb_mode)
> 		kdb_printf("PANIC: %s\n", msg);
> 
> 
> But I might be wrong. Also it seems that kgdb_cpu_enter() makes sure
> that all CPUs get synchronized before entering the main loop...
> 
> 
> Serialization of vkdb_printf() vs. printk()
> ===========================================
> 
> Let's look at the various interfaces showing the messages:
> 
>     a) gdbstub_msg_write() does not conflict with printk()
>        at all.
> 
>     b) It seems that kdb_msg_write() -> dbg_io_ops->write_char()
>        uses some special API which tries to check whether the device
>        is in a safe state and eventually waits for the safe state
> (pooling).
> 
>        It seems to be a one way synchronization. It "guarantees" that
>        kdb would start write only when safe. But it does not block
>        the other side.
> 
>        It might be safe when the other side is blocked which likely
>        happens in kgdb_cpu_enter().
> 
> 
>      c) kdb_msg_write() -> con->write()/con->write_atomic(), where
> 
> 	  + con->write() is the legacy console callback. It is
> 	    internally synchronized via some lock, e.g. port->lock.
> 
> 	    But kdb_msg_write() increments @oops_in_progress so that
> 	    the internal lock is basically ignored.
> 
> 	      => It looks like a try-and-hope approach used also by
> panic().
> 
> 
> 	  + con->write_atomic() is nbcon console callback. It is
> going
> 	    to be synchronized via the new nbcon_kdb_try_acquire()
> 
> 	       => The output won't be guaranteed because
> try_acquire()
> 		  might fail. But it should be safe.
> 
> 
>       d) printk()/pr_info() is synchronized against other printk()
> out
> 	 of box. It would show the messages on the consoles only
> when
> 	 safe.
> 
> 	 BTW: It might make sense to call printk()/pr_info() inside
> 	      nbcon_cpu_emergency_enter()/exit() section so that it
> 	      would try to flush the messages immediately when safe.

Thanks for the insights!

> 
> > > About the try_acquire part, I haven't checked about the state of
> > > the
> > > console devices when the panic happens, if they drop the
> > > ownership of
> > > the console on non-panic CPUs or not, so I'm not sure if this is
> > > needed
> > > or not. I'll wait for Petr and/or maybe John to add some info.
> > 
> > If any context owned the console and is in an unsafe section while
> > being
> > interrupted by kdb (regardless if panic or not), then
> > nbcon_kdb_try_acquire() will fail and the mirrored kdb output will
> > not
> > be visible on that console.
> > 
> > I am not sure how important it is that the output is mirrored in
> > this
> > case. A hostile takeover is not acceptable. And implementing some
> > sort
> > of delay so that the current owner has a chance to release the
> > ownership
> > (similar to what was attempted here [0]) is not only complicated,
> > but
> > has its own set of problems.
> 
> The solution proposed in this patch (nbcon_kdb_try_acquire()) looks
> acceptable to me. The output is not guaranteed. But is should
> hopefully work in most situations.
> 
> The great thing is that it would be safe in compare with the legacy
> consoles where @oops_in_progress causes ignoring the internal
> locking.

great!

> 
> 
> > Currently there is no mirrored output for nbcon consoles. With this
> > series it is at least possible.
> > 
> > BTW, in order for CPU switching during panic to be able to mirror
> > output
> > on nbcon consoles, an additional exception must be added to
> > nbcon_context_try_acquire_direct(). It would look like this:
> 
> Great idea. I am just not sure about the condition, see below.
> 
> > diff --git a/kernel/printk/nbcon.c b/kernel/printk/nbcon.c
> > index 79d8c74378061..2c168eaf378ed 100644
> > --- a/kernel/printk/nbcon.c
> > +++ b/kernel/printk/nbcon.c
> > @@ -10,6 +10,7 @@
> >  #include <linux/export.h>
> >  #include <linux/init.h>
> >  #include <linux/irqflags.h>
> > +#include <linux/kgdb.h>
> >  #include <linux/kthread.h>
> >  #include <linux/minmax.h>
> >  #include <linux/percpu.h>
> > @@ -247,6 +248,8 @@ static int
> > nbcon_context_try_acquire_direct(struct nbcon_context *ctxt,
> >  		 * Panic does not imply that the console is owned.
> > However,
> >  		 * since all non-panic CPUs are stopped during
> > panic(), it
> >  		 * is safer to have them avoid gaining console
> > ownership.
> > +		 * The only exception is if kgdb is active, which
> > may print
> > +		 * from multiple CPUs during a panic.
> >  		 *
> >  		 * If this acquire is a reacquire (and an unsafe
> > takeover
> >  		 * has not previously occurred) then it is allowed
> > to attempt
> > @@ -255,6 +258,7 @@ static int
> > nbcon_context_try_acquire_direct(struct nbcon_context *ctxt,
> >  		 * interrupted by the panic CPU while printing.
> >  		 */
> >  		if (other_cpu_in_panic() &&
> > +		    atomic_read(&kgdb_active) == -1 &&
> 
> This would likely work for most kgdb_printk() calls. But what about
> the one called from kgdb_panic()?
> 
> Alternative solution would be to allow it only for the CPU locked
> by kdb, something like:
> 
> 		    READ_ONCE(kdb_printf_cpu) !=
> raw_smp_processor_id() &&
> 
> Note that I used READ_ONCE() to guarantee an atomic read. The
> condition will fail only when we are inside a code locked by
> the kdb_printf_cpu().
> 
> An alternative would be smp_load_acquire(&kdb_printf_cpu). But
> I think that it is a "too big" hammer and it does not fit here.

Do you guys plan to send a separated patch for it later (or before) 
this patchset?

> 
> > +		    atomic_read(&kgdb_active) == -1 &&
> >  		    (!is_reacquire || cur->unsafe_takeover)) {
> >  			return -EPERM;
> >  		}
> > 
> > > > > @@ -605,7 +616,14 @@ static void kdb_msg_write(const char
> > > > > *msg, int msg_len)
> > > > > 		 * for this calling context.
> > > > > 		 */
> > > > > 		++oops_in_progress;
> > > > > -		c->write(c, msg, msg_len);
> > > > > +		if (flags & CON_NBCON) {
> > > > > +			wctxt.outbuf = (char *)msg;
> > > > > +			wctxt.len = msg_len;
> > > > > +			c->write_atomic(c, &wctxt);
> > > > > +			nbcon_kdb_release(&wctxt);
> > > > > +		} else {
> > > > > +			c->write(c, msg, msg_len);
> > > > > +		}
> > > > > 		--oops_in_progress;
> > > > > 		touch_nmi_watchdog();
> > > > > 	}
> > > > 
> > > > Dumb question, but is the oops_in_progress bump still useful
> > > > with
> > > > atomic consoles? Will the console have any spinlocks to
> > > > disregard or
> > > > will the console ownership code already handled any mutual
> > > > exclusion
> > > > issues meaning there should be no spinlocks taking by the
> > > > atomic
> > > > write handler?
> > > 
> > > IIUC, since we can have multiple consoles, and some of them are
> > > NB and
> > > others aren't, I believe that this oops_in_progress is still
> > > useful.
> > 
> > Correct, but only for legacy consoles. Please move the
> > @oops_in_progress
> > increment/decrement to only be around the c->write() call. This
> > makes it
> > clear that the hack is only "useful" for the legacy consoles.
> 
> I agree.

Great, this simplifies the logic by only checking the flags for NBCON
only once.

> 
> > John
> > 
> > [0]
> > https://lore.kernel.org/lkml/20210803131301.5588-4-john.ogness@linutronix.de
> 
> Sigh, I have already forgotten that we discussed this in the past.
> 
> Best Regards,
> Petr