linux-kernel - Re: [PATCH 0/2] irq: detect slow IRQ handlers

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20210113000953.GN2743@paulmck-ThinkPad-P72>
Date:   Tue, 12 Jan 2021 16:09:53 -0800
From:   "Paul E. McKenney" <paulmck@...nel.org>
To:     Mark Rutland <mark.rutland@....com>
Cc:     linux-kernel@...r.kernel.org, maz@...nel.org, peterz@...radead.org,
        tglx@...utronix.de
Subject: Re: [PATCH 0/2] irq: detect slow IRQ handlers

On Tue, Jan 12, 2021 at 01:59:48PM +0000, Mark Rutland wrote:
> Hi,
> 
> While fuzzing arm64 with Syzkaller (under QEMU+KVM) over a number of releases,
> I've occasionally seen some ridiculously long stalls (20+ seconds), where it
> appears that a CPU is stuck in a hard IRQ context. As this gets detected after
> the CPU returns to the interrupted context, it's difficult to identify where
> exactly the stall is coming from.
> 
> These patches are intended to help tracking this down, with a WARN() if an IRQ
> handler takes longer than a given timout (1 second by default), logging the
> specific IRQ and handler function. While it's possible to achieve similar with
> tracing, it's harder to integrate that into an automated fuzzing setup.
> 
> I've been running this for a short while, and haven't yet seen any of the
> stalls with this applied, but I've tested with smaller timeout periods in the 1
> millisecond range by overloading the host, so I'm confident that the check
> works.
> 
> Thanks,
> Mark.

Nice!

Acked-by: Paul E. McKenney <paulmck@...nel.org>

I added the patch below to add a three-second delay to the scheduling
clock interrupt handler.  This executed, but did not cause your warning
to be emitted, probably because rcutorture runs under qemu/KVM.  So no
Tested-by, not yet, anyway.

							Thanx, Paul

------------------------------------------------------------------------

diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index e04e336..dac8c7a 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -2606,6 +2606,8 @@ static void rcu_do_batch(struct rcu_data *rdp)
  */
 void rcu_sched_clock_irq(int user)
 {
+	static atomic_t invctr;
+
 	trace_rcu_utilization(TPS("Start scheduler-tick"));
 	lockdep_assert_irqs_disabled();
 	raw_cpu_inc(rcu_data.ticks_this_gp);
@@ -2623,6 +2625,14 @@ void rcu_sched_clock_irq(int user)
 		invoke_rcu_core();
 	lockdep_assert_irqs_disabled();
 
+	if (atomic_inc_return(&invctr) % 0x3ffff == 0) {
+		int i;
+
+		pr_alert("%s: 3-second delay.\n", __func__);
+		for (i = 0; i < 3000; i++)
+			udelay(1000);
+	}
+
 	trace_rcu_utilization(TPS("End scheduler-tick"));
 }