linux-kernel - Re: CSD lockup during kexec due to unbounded busy-wait in pl011_console_write

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <qluelhof4piilyqbyanflp3qdljxak73kt2yvahkaby6vmyzzu@qgvqej7kdio5>
Date: Mon, 1 Dec 2025 09:04:07 -0800
From: Breno Leitao <leitao@...ian.org>
To: Petr Mladek <pmladek@...e.com>
Cc: john.ogness@...utronix.de, linux@...linux.org.uk, paulmck@...nel.org, 
	usamaarif642@...il.com, leo.yan@....com, linux-arm-kernel@...ts.infradead.org, 
	linux-kernel@...r.kernel.org, kernel-team@...a.com, rmikey@...a.com
Subject: Re: CSD lockup during kexec due to unbounded busy-wait in
 pl011_console_write_atomic (arm64)

Hello Petr,

On Fri, Nov 28, 2025 at 05:08:17PM +0100, Petr Mladek wrote:
> On Tue 2025-11-25 08:02:16, Breno Leitao wrote:
>
> I do _not_ think that the CPU was waiting in pl011_console_write_atomic() in the
> the following cycle the entire 11 secs:
> 
> 	while ((pl011_read(uap, REG_FR) ^ uap->vendor->inv_fr) & uap->vendor->fr_busy)
> 		cpu_relax();
> 
> A more likely scenario was that pl011_console_write_atomic() was
> called several times during this period because there were more
> pending messages.

Probably. Most of the messages are coming from CPU being powered off:

	[   44.119433] psci: CPU1 killed (polled 0 ms)
	[   44.146057] psci: CPU2 killed (polled 0 ms)
	[   44.182058] psci: CPU3 killed (polled 0 ms)
	[   44.218031] psci: CPU4 killed (polled 0 ms)
	[   44.252962] psci: CPU5 killed (polled 0 ms)
	[   44.276939] psci: CPU6 killed (polled 0 ms)
	[   44.296152] psci: CPU7 killed (polled 1 ms)
	....

And this only happens on "large" machines, thus, the host is flushing
a lot of messages during kexec turn down time.

> >   printk_kthreads_shutdown (kernel/printk/printk.c:?)
> 
> But the function seems be called with IRQs enabled. So that it might
> help to restore IRQs after each flushed message.

Agree. This would make the irq-disabled sections much smaller, with
a higher changes of IPIs and NMIs (on arm64 hosts without FEAT_NMI).

> But we could extend the existing commit d5d399efff6577 ("printk/nbcon:
> Release nbcon consoles ownership in atomic flush after each emitted
> record") and restore IRQs after each emitted record.
> 
> I wonder if the following patch would help in this scenario.
> It is made on top of "for-next" branch in printk/linux.git.
> But the most important pre-requisite is the above mentioned commit
> in the branch "rework/atomic-flush-hardlockup".
> 
> Note that the patch is only compile tested.

I've tested the patch and I don't see the CSD lockups anymore.
Thanks for the quick fix.

> Closes: https://lore.kernel.org/r/sqwajvt7utnt463tzxgwu2yctyn5m6bjwrslsnupfexeml6hkd@v6sqmpbu3vvu
> Signed-off-by: Petr Mladek <pmladek@...e.com>

Tested-by: Breno Leitao <leitao@...ian.org>

Thanks for all people involved in here. With this last patch (that makes
the irq-disbled section smaller), and kfence not IPIing during kexec
time, I consider this issue closed. 

--breno