linux-kernel - Re: [RFC 0/1] serial: 8250: nbcon_atomic_flush_pending() might trigger watchdog warnigns

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <aNQO-zl3k1l4ENfy@pathway.suse.cz>
Date: Wed, 24 Sep 2025 17:32:11 +0200
From: Petr Mladek <pmladek@...e.com>
To: John Ogness <john.ogness@...utronix.de>
Cc: Greg Kroah-Hartman <gregkh@...uxfoundation.org>,
	Jiri Slaby <jirislaby@...nel.org>,
	Sergey Senozhatsky <senozhatsky@...omium.org>,
	Steven Rostedt <rostedt@...dmis.org>,
	Thomas Gleixner <tglx@...utronix.de>,
	Esben Haabendal <esben@...nix.com>, linux-serial@...r.kernel.org,
	linux-kernel@...r.kernel.org,
	Andy Shevchenko <andriy.shevchenko@...ux.intel.com>,
	Arnd Bergmann <arnd@...db.de>, Tony Lindgren <tony@...mide.com>,
	Niklas Schnelle <schnelle@...ux.ibm.com>,
	Serge Semin <fancer.lancer@...il.com>,
	Andrew Murray <amurray@...goodpenguin.co.uk>
Subject: Re: [RFC 0/1] serial: 8250: nbcon_atomic_flush_pending() might
 trigger watchdog warnigns

Added Andred Murray into Cc.

On Wed 2025-09-24 14:48:12, John Ogness wrote:
> On 2025-09-24, Petr Mladek <pmladek@...e.com> wrote:
> > I tried to implement some naive solution, see below. I think that
> > it can be described as the 2nd layer of ownership.
> >
> > It does not look that complicated. It might be because I used it only
> > in two paths which do the writing. And it is possible that it does
> > not work properly.
> 
> It is an interesting approach and is one that I tend to prefer.
> 
> > Then I got another idea. It might even easier.
> >
> > I think that it might actually be enough to block the kthread when
> > any CPU is in emergency context, for example, by using a global
> > atomit counter.
> 
> This is the quick idea that usually comes first. Basically introducing
> an emergency_in_progress() to keep the kthreads out. My problem with
> this is that it causes _all_ consoles to synchronize with each other,
> which we worked hard to get away from. Yes, I realize Linus rejected the
> "store the backtrace, then print it" implementation, which limits the
> effectiveness of parallel printing. Nevertheless, I would prefer to work
> towards returning to parallelization, rather than reducing it further.

Let me play the devil advocate for a while.

IMHO, keeping the kthreads running in parallel for a synchronous
emergency report opens a can of worms.

Yes, the kthreads might be fast enough when the emergency context
is busy with a slow console. But they might also cause repeated
takeovers for "every" message when a kthread starts emitting
each new message just to lose the ownership after few characters.

Honestly, blocking the kthreads during an emergency does not look
that bad to me.

> > I am not sure if you already started working on it. I rather share
> > my naive ideas early so that I do not duplicate the effort.
> > It is possible that you have been there.
> 
> Thanks. Yes, I tried various ideas like this. But your version does go
> about it differently. Although I am seeing the same problems. My
> comments below.
> 
> > +void nbcon_context_put_flush_prio(struct nbcon_write_context *wctxt)
> > +{
> > +	struct nbcon_context *ctxt = &ACCESS_PRIVATE(wctxt, ctxt);
> > +	struct console *con = ctxt->console;
> > +	ret = 0;
> > +
> > +	if (!nbcon_context_try_acquire(ctxt, false))
> > +		return -EPERM;
> 
> This is my main concern. If a context has set the flush_prio to
> something higher, it _MUST_ set it back down later. This cannot be best
> effort. A failed acquire does not mean that a context with the same
> priority is the owner (for example, it could be a NORMAL context that is
> in an unsafe section). Remember that there are owners that are not
> printers.
> 
> So now we have a similar situation as nbcon_reacquire_nobuf(): we need
> to regain ownership so that we can undo something we setup. And that is
> the problem we are trying to solve in the first place. Maybe since this
> moves the problem from NORMAL to EMERGENCY, the busy-waiting is
> acceptable.
>
> As you mentioned, there is the problem that the flushing context could
> change hands multiple times before the flush_prio is restored. And there
> is also the recursive case where a WARN could happen within a WARN, or a
> WARN could happen and then in an NMI another WARN. All these cases
> probably mean that it needs to be a per-prio counter rather than simply
> a single prio so that each context can increment/decrement their prio.

Thanks a lot for showing all the problems.

It might be easier when we combine it with the first approach. I mean
to block only the console-specific kthread from
__nbcon_atomic_flush_pending_con(). By other words, block
the kthread from __nbcon_atomic_flush_pending_con() instead of
emergency_enter() and store the counter in struct console.


Summary:

We currently have the following solutions for the original
problem (hardlockup in nbcon_reacquire_nobuf()):


1. Touch the watchdog in nbcon_reacquire_nobuf()

   Pros:
	+ trivial

   Cons:
	+ Two CPUs might be blocked by slow serial consoles.


2. Yield nbcon console context ownership between each record
   and block all kthreads from emergency_enter/exit API

   Pros:
	+ Only one CPU is blocked by slow serial console
	+ Prevents repeated takeovers for "every" new message

   Cons:
	+ More complex than 1
	+ Completely give up on parallel console handling in emergency


3. Yield nbcon console context ownership between each record
   and block only one kthread from __nbcon_atomic_flush_pending_con()

   Pros:
	+ Only one CPU is blocked by slow serial console
	+ Parallel console handling still possible in emergency

   Cons:
	+ More complex than 1   (similar to 2)
	+ Possible repeated takeovers for "every" new emergency message


Well, releasing the console context ownership after each record
might solve also some other problems [*]

I am going to try implementing the 3rd solution and see how
complicated  it would be.

It would be possible to change it two 2nd easily just by
using a global counter and updating it in emergency_enter/exit API.


[*] Andrew Murray is trying to do similar thing with console_lock
    and the legacy_kthread, see
    https://lore.kernel.org/r/20250915-printk_legacy_thread_console_lock-v1-0-f34d42a9bcb3@thegoodpenguin.co.uk

    He told me off-list that he saw similar problems also with nbcon_thread.
    I am not sure but it will likely be related to
    __nbcon_atomic_flush_pending_con() blocking a nbcon console context
    for too long.

Best Regards,
Petr