lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <nycvar.YFH.7.76.2004232141590.19713@cbobk.fhfr.pm>
Date:   Thu, 23 Apr 2020 21:48:20 +0200 (CEST)
From:   Jiri Kosina <jikos@...nel.org>
To:     Sebastian Andrzej Siewior <bigeasy@...utronix.de>
cc:     Thomas Gleixner <tglx@...utronix.de>,
        LKML <linux-kernel@...r.kernel.org>,
        linux-rt-users <linux-rt-users@...r.kernel.org>,
        Steven Rostedt <rostedt@...dmis.org>,
        Frederic Weisbecker <fweisbec@...il.com>,
        Matt Fleming <matt@...eblueprint.co.uk>,
        Daniel Wagner <dwagner@...e.de>
Subject: Re: [PREEMPT_RT] 8250 IRQ lockup when flooding serial console (was
 Re: [ANNOUNCE] v5.4.28-rt19)

On Thu, 23 Apr 2020, Sebastian Andrzej Siewior wrote:

> The IRQ4 is edge and in charge of ttyS0. It is handled by 
> handle_edge_irq() and after ->irq_ack(), the thread is woken up and then 
> we get another ->handle_edge_irq() for IRQ4. With larger PASS_LIMIT the 
> thread runs longer so note_interrupt() will make less IRQ_HANDLED based 
> on ->threads_handled_last. If it observes 100 handled within 100000 
> interrupts then the counters are reset again. On !RT it usually manages 
> to get >100 per 100000 interrupts so it appears good. On RT it gets less 
> and the interrupt gets disabled.
> 
> So it is not RT related, but RT triggers it more reliably (also the 
> PASS_LIMIT change can vanish). I can't tell if this is a qemu bug in 
> emulating the HW or not. I can't reproduce it real HW. I see a second 
> edge interrupt only after the thread completed. I can't tell if this is 
> because it is a real UART and the data is flowing slower or because the 
> edge-IRQ is not triggered repeatedly.

Yeah, it's all strange. In the hope of understanding the issue a little 
bit better, I tried to disable IRQs in serial8250_handle_irq() to mimic 
what !PREEMPT_RT spinlock would do; the idea was that this is some kind of 
strange race / memory ordering (missed ack?) between the threaded irq4 
handler and the do_IRQ() -> handle_edge_irq() -> ... path.

So I did this:

---
 drivers/tty/serial/8250/8250_port.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/drivers/tty/serial/8250/8250_port.c b/drivers/tty/serial/8250/8250_port.c
index e31217e8dce6..1a577305e174 100644
--- a/drivers/tty/serial/8250/8250_port.c
+++ b/drivers/tty/serial/8250/8250_port.c
@@ -1813,12 +1813,13 @@ static bool handle_rx_dma(struct uart_8250_port *up, unsigned int iir)
 int serial8250_handle_irq(struct uart_port *port, unsigned int iir)
 {
 	unsigned char status;
-	unsigned long flags;
+	unsigned long flags, f;
 	struct uart_8250_port *up = up_to_u8250p(port);
 
 	if (iir & UART_IIR_NO_INT)
 		return 0;
 
+	local_irq_save(f);
 	spin_lock_irqsave(&port->lock, flags);
 
 	status = serial_port_in(port, UART_LSR);
@@ -1833,6 +1834,7 @@ int serial8250_handle_irq(struct uart_port *port, unsigned int iir)
 		serial8250_tx_chars(up);
 
 	uart_unlock_and_check_sysrq(port, flags);
+	local_irq_restore(f);
 	return 1;
 }

But curiously enough, that exploded in the oposite order (so first there 
was CPU stall, and only later the disabling of IRQ4 due to spurious 
storm):

[  107.308525] 000: rcu: INFO: rcu_preempt self-detected stall on CPU
[  107.308542] 000: rcu: 	0-....: (20985 ticks this GP) idle=b2a/1/0x4000000000000002 softirq=0/0 fqs=4841 
[  107.308563] 000: 	(t=21000 jiffies g=16217 q=22343)
[  107.308565] 000: NMI backtrace for cpu 0
[  107.308570] 000: CPU: 0 PID: 1213 Comm: irq/4-ttyS0 Not tainted 5.6.4-rt19+ #11
[  107.308577] 000: Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-0-ga698c89-rebuilt.suse.com 04/01/2014
[  107.308579] 000: Call Trace:
[  107.308599] 000:  <IRQ>
[  107.308608] 000:  dump_stack+0x50/0x6b
[  107.308630] 000:  nmi_cpu_backtrace+0x8a/0x90
[  107.308639] 000:  ? lapic_can_unplug_cpu+0xd0/0xd0
[  107.308644] 000:  nmi_trigger_cpumask_backtrace+0xe9/0x130
[  107.308647] 000:  rcu_dump_cpu_stacks+0x8e/0xbc
[  107.308658] 000:  rcu_sched_clock_irq+0x70a/0xad0
[  107.308661] 000:  ? cpuacct_account_field+0x16/0x70
[  107.308667] 000:  ? tick_sched_do_timer+0x60/0x60
[  107.308675] 000:  update_process_times+0x1f/0x40
[  107.308680] 000:  tick_sched_handle+0x2f/0x40
[  107.308685] 000:  tick_sched_timer+0x3e/0x80
[  107.308687] 000:  __hrtimer_run_queues+0x11e/0x3c0
[  107.308691] 000:  hrtimer_interrupt+0xea/0x250
[  107.308694] 000:  smp_apic_timer_interrupt+0x84/0x200
[  107.308704] 000:  apic_timer_interrupt+0xf/0x20
[  107.308709] 000:  </IRQ>
[  107.308713] 000: RIP: 0010:serial8250_handle_irq.part.26+0x9e/0xd0
[  107.308719] 000: Code: 48 85 c0 74 09 80 b8 e9 00 00 00 00 74 0f 41 83 e5 20 74 09 f6 83 36 02 00 00 02 75 2d 31 f6 48 89 df e8 65 87 ff ff 41 54 9d <b8> 01 00 00 00 5b 5d 41 5c 41 5d 41 5e c3 48 8b 40 08 48 89 df e8
[  107.308723] 000: RSP: 0018:ffffb6ffc0a63e20 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff13
[  107.308726] 000: RAX: ffff90b23734e040 RBX: ffffffff9f5a0de0 RCX: 0000000000000000
[  107.308728] 000: RDX: 0000000000000003 RSI: 0000000000000000 RDI: ffffffff9f5a0de0
[  107.308729] 000: RBP: 0000000000000060 R08: ffffffff9ea0b7a0 R09: ffffb6ffc0a63d78
[  107.308730] 000: R10: 0000000000000001 R11: 0000000000000000 R12: 0000000000000246
[  107.308731] 000: R13: 0000000000000020 R14: 00000000000000c2 R15: ffff90b2ba711e58
[  107.308739] 000:  serial8250_default_handle_irq+0x45/0x50
[  107.308741] 000:  serial8250_interrupt+0x51/0xa0
[  107.308749] 000:  ? irq_finalize_oneshot.part.46+0xd0/0xd0
[  107.308755] 000:  irq_forced_thread_fn+0x2b/0x70
[  107.308758] 000:  irq_thread+0xd8/0x170
[  107.308760] 000:  ? wake_threads_waitq+0x30/0x30
[  107.308761] 000:  ? irq_thread_dtor+0x90/0x90
[  107.308763] 000:  kthread+0x10e/0x130
[  107.308769] 000:  ? kthread_park+0x80/0x80
[  107.308770] 000:  ret_from_fork+0x35/0x40
[  165.360370] 000: irq 4: nobody cared (try booting with the "irqpoll" option)
[  165.360392] 000: CPU: 0 PID: 1213 Comm: irq/4-ttyS0 Not tainted 5.6.4-rt19+ #11
[  165.360398] 000: Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-0-ga698c89-rebuilt.suse.com 04/01/2014
[  165.360599] 000: Call Trace:
[  165.361639] 000:  <IRQ>
[  165.361651] 000:  dump_stack+0x50/0x6b
[  165.361670] 000:  __report_bad_irq+0x2b/0xb0
[  165.361678] 000:  note_interrupt+0x22e/0x280
[  165.361680] 000:  ? serial8250_handle_irq.part.26+0x9e/0xd0
[  165.361685] 000:  handle_irq_event_percpu+0x6e/0x90
[  165.361693] 000:  handle_irq_event+0x48/0x90
[  165.361695] 000:  handle_edge_irq+0x95/0x1f0
[  165.361700] 000:  do_IRQ+0x6c/0x120
[  165.361709] 000:  common_interrupt+0xf/0xf
[  165.361719] 000:  </IRQ>
[  165.361720] 000: RIP: 0010:serial8250_handle_irq.part.26+0x9e/0xd0
[  165.361727] 000: Code: 48 85 c0 74 09 80 b8 e9 00 00 00 00 74 0f 41 83 e5 20 74 09 f6 83 36 02 00 00 02 75 2d 31 f6 48 89 df e8 65 87 ff ff 41 54 9d <b8> 01 00 00 00 5b 5d 41 5c 41 5d 41 5e c3 48 8b 40 08 48 89 df e8
[  165.361731] 000: RSP: 0018:ffffb6ffc0a63e20 EFLAGS: 00000246 ORIG_RAX: ffffffffffffffdb
[  165.361732] 000: RAX: ffff90b23734e040 RBX: ffffffff9f5a0de0 RCX: 0000000000000001
[  165.361736] 000: RDX: 0000000000000003 RSI: 0000000000000097 RDI: 00000000ffffffff
[  165.361737] 000: RBP: 0000000000000060 R08: ffffffff9ea0b7a0 R09: ffffb6ffc0a63d78
[  165.361739] 000: R10: 0000000000000001 R11: 0000000000000000 R12: 0000000000000246
[  165.361740] 000: R13: 0000000000000020 R14: 00000000000000c2 R15: ffff90b2ba711e58
[  165.361749] 000:  serial8250_default_handle_irq+0x45/0x50
[  165.361752] 000:  serial8250_interrupt+0x51/0xa0
[  165.361759] 000:  ? irq_finalize_oneshot.part.46+0xd0/0xd0
[  165.361762] 000:  irq_forced_thread_fn+0x2b/0x70
[  165.361767] 000:  irq_thread+0xd8/0x170
[  165.361771] 000:  ? wake_threads_waitq+0x30/0x30
[  165.361773] 000:  ? irq_thread_dtor+0x90/0x90
[  165.361775] 000:  kthread+0x10e/0x130
[  165.361780] 000:  ? kthread_park+0x80/0x80
[  165.361782] 000:  ret_from_fork+0x35/0x40
[  165.361787] 000: handlers:
[  165.361793] 000: [<00000000005f10ab>] irq_default_primary_handler threaded [<00000000853a4a7e>] serial8250_interrupt
[  165.361795] 000: Disabling IRQ #4

So this time the storm + stall apparently was triggered by the 'popf' 
coming from my added local_irq_restore().

It's really curious.

I'd be happy to blame it on some qemu breakage, but it seems to be 
happening on any randomly chosen version so far.

Thanks,

-- 
Jiri Kosina
SUSE Labs

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ