linux-kernel - Re: [PREEMPT_RT] 8250 IRQ lockup when flooding serial console (was Re: [ANNOUNCE] v5.4.28-rt19)

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite for Android: free password hash cracker in your pocket

[<prev] [next>] [<thread-prev] [day] [month] [year] [list]

Message-ID: <20200427091701.ezf4nule5hg6jziq@linutronix.de>
Date:   Mon, 27 Apr 2020 11:17:01 +0200
From:   Sebastian Andrzej Siewior <bigeasy@...utronix.de>
To:     Jiri Kosina <jikos@...nel.org>
Cc:     Thomas Gleixner <tglx@...utronix.de>,
        LKML <linux-kernel@...r.kernel.org>,
        linux-rt-users <linux-rt-users@...r.kernel.org>,
        Steven Rostedt <rostedt@...dmis.org>,
        Frederic Weisbecker <fweisbec@...il.com>,
        Matt Fleming <matt@...eblueprint.co.uk>,
        Daniel Wagner <dwagner@...e.de>
Subject: Re: [PREEMPT_RT] 8250 IRQ lockup when flooding serial console (was
 Re: [ANNOUNCE] v5.4.28-rt19)

On 2020-04-24 22:54:51 [+0200], Jiri Kosina wrote:
> It's still wonky; with the two hunks above on top of 5.6.4-rt3 (that's 
> without the PASS_LIMIT adjustment) flooding the emulated serial console 
> still emits the splat below.
…
> So now the endless interrupt storm comes at a different point -- exactly 
> once IRQs get re-enabled in prb_unlock(). How we reach prb_unlock() from 
> serial8250_tx_chars() I still have to understand. Worth involving John?

My guess is that it is unrelated and it is simply code that
disabled/enabled interrupts at the time the NMI was was triggered.

…
> [   75.286440] 000: rcu: INFO: rcu_preempt self-detected stall on CPU
> [   75.286533] 000: rcu: 	0-....: (1 GPs behind) idle=94a/1/0x4000000000000002 softirq=0/0 fqs=5167 
> [   75.286556] 000: 	(t=21000 jiffies g=15213 q=25248)

a RCU stall but it is only one GP behind :)
My guess here would be that simply we never had the opportunity to
perform perform a GP callbacks and nobody entered a RCU critical section
(we were busy printing on the console the whole time).

So a dummy RCU section like this:

diff --git a/kernel/irq/manage.c b/kernel/irq/manage.c
index f61f6f5426eff..5636123a90580 100644
--- a/kernel/irq/manage.c
+++ b/kernel/irq/manage.c
@@ -1145,6 +1145,8 @@ static int irq_thread(void *data)
 		migrate_enable();
 #endif
 		wake_threads_waitq(desc);
+		rcu_read_lock();
+		rcu_read_unlock();
 	}

 	/*

should work given the RCU-boosting is enabled.

…
> [  134.432670] 000: irq 4: nobody cared (try booting with the "irqpoll" option)
> [  134.432685] 000: CPU: 0 PID: 1209 Comm: irq/4-ttyS0 Not tainted 5.6.4-rt19-00003-g5cf51e8702ad #16
> [  134.432690] 000: Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-0-ga698c89-rebuilt.suse.com 04/01/2014

yeah. again. I'm not sure if this is good or bad. The threaded
IRQ-handler runs constantly in this scenario. The core code *thinks*
that the handler makes no progress or is stuck and so it disables it.
It is not so far fetched. It wouldn't happen on real hardware actual HW
would take more time and so not "stuck" in the handler endlessly.

In networking, we would have NAPI which then pushes the driver to the
ksofitrqd which runs at SCHED_OTHER while here the IRQ thread runs at a
RT priority. I don't think we should add something like this to the 8250
driver to deal with the situation.

Sebastian