[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <aEGeARVcCwqcoHb8@pathway.suse.cz>
Date: Thu, 5 Jun 2025 15:39:44 +0200
From: Petr Mladek <pmladek@...e.com>
To: "Toshiyuki Sato (Fujitsu)" <fj6611ie@...itsu.com>
Cc: John Ogness <john.ogness@...utronix.de>,
'Michael Kelley' <mhklinux@...look.com>,
'Ryo Takakura' <ryotkkr98@...il.com>,
Russell King <linux@...linux.org.uk>,
Greg Kroah-Hartman <gregkh@...uxfoundation.org>,
Jiri Slaby <jirislaby@...nel.org>,
"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
"linux-serial@...r.kernel.org" <linux-serial@...r.kernel.org>,
"linux-arm-kernel@...ts.infradead.org" <linux-arm-kernel@...ts.infradead.org>
Subject: Re: Problem with nbcon console and amba-pl011 serial port
On Thu 2025-06-05 05:27:56, Toshiyuki Sato (Fujitsu) wrote:
> Hi John and Petr,
>
> > On Wed 2025-06-04 13:56:34, John Ogness wrote:
> > > On 2025-06-04, Petr Mladek <pmladek@...e.com> wrote:
> > > > On Wed 2025-06-04 04:11:10, Toshiyuki Sato (Fujitsu) wrote:
> > > >> > On 2025-06-03, John Ogness <john.ogness@...utronix.de> wrote:
> > > >> > > On 2025-06-03, "Toshiyuki Sato (Fujitsu)" <fj6611ie@...itsu.com> wrote:
> > > >> > >>> 4. pr_emerg() has a high logging level, and it effectively steals the console
> > > >> > >>> from the "pr/ttyAMA0" task, which I believe is intentional in the nbcon
> > > >> > design.
> > > >> > >>> Down in pl011_console_write_thread(), the "pr/ttyAMA0" task is doing
> > > >> > >>> nbcon_enter_unsafe() and nbcon_exit_unsafe() around each character
> > > >> > >>> that it outputs. When pr_emerg() steals the console, nbcon_exit_unsafe()
> > > >> > >>> returns 0, so the "for" loop exits. pl011_console_write_thread() then
> > > >> > >>> enters a busy "while" loop waiting to reclaim the console. It's doing this
> > > >> > >>> busy "while" loop with interrupts disabled, and because of the panic,
> > > >> > >>> it never succeeds.
> > > >
> > > > I am a bit surprised that it never succeeds. The panic CPU takes over
> > > > the ownership but it releases it when the messages are flushed. And
> > > > the original owner should be able to reacquire it in this case.
> > >
> > > The problem is that other_cpu_in_panic() will return true forever, which
> > > will cause _all_ acquires to fail forever. Originally we did allow
> > > non-panic to take over again after panic releases ownership. But IIRC we
> > > removed that capability because it allowed us to reduce a lot of
> > > complexity. And now nbcon_waiter_matches() relies on "Lower priorities
> > > are ignored during panic() until reboot."
> >
> > Great catch! I forgot it. And it explains everything.
> >
> > It would be nice to mention this in the commit message or
> > in the comment above nbcon_reacquire_nobuf().
> >
> > My updated prosal of the comment is:
> >
> > * Return: True when the context reacquired the owner ship. The caller
> > * might try entering the unsafe state and restore the original
> > * console device setting. It must not access the output buffer
> > * anymore.
> > *
> > * False when another CPU is in panic(). nbcon_try_acquire()
> > * would never succeed and the infinite loop would prevent
> > * stopping this CPU on architectures without proper NMI.
> > * The caller should bail out immediately without
> > * touching the console device or the output buffer.
> >
> > Best Regards,
> > Petr
>
> Thank you for your comments and suggestions.
>
> After consideration,
> I believe that there is no problem with forcibly terminating the process when
> nbcon_reacquire_nobuf returns false at the pl011 driver level,
> as in the test patch.
>
> It feels a bit harsh that a thread which started processing before the panic
> and then transferred ownership to an atomic operation isn't allowed to perform
> cleanup during panic handling or the grace period before the CPU halts.
>
> I would like to hear your opinion on this.
> If nbcon_reacquire_nobuf() could acquire ownership even after the panic,
> then driver-side modifications might not be necessary.
> (The responsibility to complete the process without hindering the panic process
> would still remain.)
>
> Would it be difficult to make an exception to the rule,
> "Lower priorities are ignored during panic() until reboot,"
> depending on the situation?
Good question.
The following two problems came to my mind:
1. As John, pointed out, the fact that non-panic CPUs are not
able to acquire the context allowed to simplify the implementation.
I think that it is primary about nbcon_waiter_matches(),
nbcon_owner_matches(), and the related logic. It was documented by
the commit 8c9dab2c55ad7 ("printk: nbcon: Clarify rules of
the owner/waiter matching").
But it seems that nbcon_owner_matches() is safe even without the rule.
The race is prevented either by disabling interrupts and preemption
or by taking device_lock().
The rule prevents a race in nbcon_waiter_matches(). But it seems
that in the worst case, more CPUs might end up busy waiting.
And it would be acceptable during panic().
So, this need not be a big problem in the end.
2. If we allowed non-panic() CPUs to acquire the ownership, it would
increase the risk that the panic CPU will not be able to
flush the messages.
But maybe, the problem is only when the architecture supports
proper NMI and non-panic CPUs might be stopped anywhere.
It should be less problem on architectures without proper NMI
where the non-panic CPU could not be stopped in the problematic
situation.
So, maybe, we could relax the rule on architectures without
proper NMI.
The question is if it is worth it. Is the clean up really important?
Note that the clean up will never be guaranteed on architectures with
a proper NMI. They would stop the non-panic CPUs, including the printk
kthread, anywhere.
And I guess that the console devices will be initialized after the
reboot anyway.
Best Regards,
Petr
Powered by blists - more mailing lists