[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <aNQO-zl3k1l4ENfy@pathway.suse.cz>
Date: Wed, 24 Sep 2025 17:32:11 +0200
From: Petr Mladek <pmladek@...e.com>
To: John Ogness <john.ogness@...utronix.de>
Cc: Greg Kroah-Hartman <gregkh@...uxfoundation.org>,
Jiri Slaby <jirislaby@...nel.org>,
Sergey Senozhatsky <senozhatsky@...omium.org>,
Steven Rostedt <rostedt@...dmis.org>,
Thomas Gleixner <tglx@...utronix.de>,
Esben Haabendal <esben@...nix.com>, linux-serial@...r.kernel.org,
linux-kernel@...r.kernel.org,
Andy Shevchenko <andriy.shevchenko@...ux.intel.com>,
Arnd Bergmann <arnd@...db.de>, Tony Lindgren <tony@...mide.com>,
Niklas Schnelle <schnelle@...ux.ibm.com>,
Serge Semin <fancer.lancer@...il.com>,
Andrew Murray <amurray@...goodpenguin.co.uk>
Subject: Re: [RFC 0/1] serial: 8250: nbcon_atomic_flush_pending() might
trigger watchdog warnigns
Added Andred Murray into Cc.
On Wed 2025-09-24 14:48:12, John Ogness wrote:
> On 2025-09-24, Petr Mladek <pmladek@...e.com> wrote:
> > I tried to implement some naive solution, see below. I think that
> > it can be described as the 2nd layer of ownership.
> >
> > It does not look that complicated. It might be because I used it only
> > in two paths which do the writing. And it is possible that it does
> > not work properly.
>
> It is an interesting approach and is one that I tend to prefer.
>
> > Then I got another idea. It might even easier.
> >
> > I think that it might actually be enough to block the kthread when
> > any CPU is in emergency context, for example, by using a global
> > atomit counter.
>
> This is the quick idea that usually comes first. Basically introducing
> an emergency_in_progress() to keep the kthreads out. My problem with
> this is that it causes _all_ consoles to synchronize with each other,
> which we worked hard to get away from. Yes, I realize Linus rejected the
> "store the backtrace, then print it" implementation, which limits the
> effectiveness of parallel printing. Nevertheless, I would prefer to work
> towards returning to parallelization, rather than reducing it further.
Let me play the devil advocate for a while.
IMHO, keeping the kthreads running in parallel for a synchronous
emergency report opens a can of worms.
Yes, the kthreads might be fast enough when the emergency context
is busy with a slow console. But they might also cause repeated
takeovers for "every" message when a kthread starts emitting
each new message just to lose the ownership after few characters.
Honestly, blocking the kthreads during an emergency does not look
that bad to me.
> > I am not sure if you already started working on it. I rather share
> > my naive ideas early so that I do not duplicate the effort.
> > It is possible that you have been there.
>
> Thanks. Yes, I tried various ideas like this. But your version does go
> about it differently. Although I am seeing the same problems. My
> comments below.
>
> > +void nbcon_context_put_flush_prio(struct nbcon_write_context *wctxt)
> > +{
> > + struct nbcon_context *ctxt = &ACCESS_PRIVATE(wctxt, ctxt);
> > + struct console *con = ctxt->console;
> > + ret = 0;
> > +
> > + if (!nbcon_context_try_acquire(ctxt, false))
> > + return -EPERM;
>
> This is my main concern. If a context has set the flush_prio to
> something higher, it _MUST_ set it back down later. This cannot be best
> effort. A failed acquire does not mean that a context with the same
> priority is the owner (for example, it could be a NORMAL context that is
> in an unsafe section). Remember that there are owners that are not
> printers.
>
> So now we have a similar situation as nbcon_reacquire_nobuf(): we need
> to regain ownership so that we can undo something we setup. And that is
> the problem we are trying to solve in the first place. Maybe since this
> moves the problem from NORMAL to EMERGENCY, the busy-waiting is
> acceptable.
>
> As you mentioned, there is the problem that the flushing context could
> change hands multiple times before the flush_prio is restored. And there
> is also the recursive case where a WARN could happen within a WARN, or a
> WARN could happen and then in an NMI another WARN. All these cases
> probably mean that it needs to be a per-prio counter rather than simply
> a single prio so that each context can increment/decrement their prio.
Thanks a lot for showing all the problems.
It might be easier when we combine it with the first approach. I mean
to block only the console-specific kthread from
__nbcon_atomic_flush_pending_con(). By other words, block
the kthread from __nbcon_atomic_flush_pending_con() instead of
emergency_enter() and store the counter in struct console.
Summary:
We currently have the following solutions for the original
problem (hardlockup in nbcon_reacquire_nobuf()):
1. Touch the watchdog in nbcon_reacquire_nobuf()
Pros:
+ trivial
Cons:
+ Two CPUs might be blocked by slow serial consoles.
2. Yield nbcon console context ownership between each record
and block all kthreads from emergency_enter/exit API
Pros:
+ Only one CPU is blocked by slow serial console
+ Prevents repeated takeovers for "every" new message
Cons:
+ More complex than 1
+ Completely give up on parallel console handling in emergency
3. Yield nbcon console context ownership between each record
and block only one kthread from __nbcon_atomic_flush_pending_con()
Pros:
+ Only one CPU is blocked by slow serial console
+ Parallel console handling still possible in emergency
Cons:
+ More complex than 1 (similar to 2)
+ Possible repeated takeovers for "every" new emergency message
Well, releasing the console context ownership after each record
might solve also some other problems [*]
I am going to try implementing the 3rd solution and see how
complicated it would be.
It would be possible to change it two 2nd easily just by
using a global counter and updating it in emergency_enter/exit API.
[*] Andrew Murray is trying to do similar thing with console_lock
and the legacy_kthread, see
https://lore.kernel.org/r/20250915-printk_legacy_thread_console_lock-v1-0-f34d42a9bcb3@thegoodpenguin.co.uk
He told me off-list that he saw similar problems also with nbcon_thread.
I am not sure but it will likely be related to
__nbcon_atomic_flush_pending_con() blocking a nbcon console context
for too long.
Best Regards,
Petr
Powered by blists - more mailing lists