lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:	Thu, 6 Jun 2013 13:55:14 -0700
From:	Tejun Heo <tj@...nel.org>
To:	Ben Greear <greearb@...delatech.com>
Cc:	Eric Dumazet <eric.dumazet@...il.com>,
	Rusty Russell <rusty@...tcorp.com.au>,
	Joe Lawrence <joe.lawrence@...atus.com>,
	Linux Kernel Mailing List <linux-kernel@...r.kernel.org>,
	stable@...r.kernel.org,
	"Luis R. Rodriguez" <mcgrof@....qualcomm.com>,
	Jouni Malinen <jouni@....qualcomm.com>,
	Vasanthakumar Thiagarajan <vthiagar@....qualcomm.com>,
	Senthil Balasubramanian <senthilb@....qualcomm.com>,
	linux-wireless@...r.kernel.org, ath9k-devel@...ema.h4ckr.net,
	Thomas Gleixner <tglx@...utronix.de>,
	Ingo Molnar <mingo@...hat.com>
Subject: Re: stop_machine lockup issue in 3.9.y.

Hello, Ben.

On Wed, Jun 05, 2013 at 08:41:01PM -0700, Ben Greear wrote:
> On 06/05/2013 08:26 PM, Eric Dumazet wrote:
> >On Wed, 2013-06-05 at 20:14 -0700, Tejun Heo wrote:
> >>Ah, so, that's why it's showing up now.  We probably have had the same
> >>issue all along but it used to be masked by the softirq limiting.  Do
> >>you care to revive the 10 iterations limit so that it's limited by
> >>both the count and timing?  We do wanna find out why softirq is
> >>spinning indefinitely tho.
> >
> >Yes, no problem, I can do that.
> 
> Limiting it to 5000 fixes my problem, so if you wanted it larger than 10, that would
> be fine by me.

First of all, kudos for tracking the issue down.  While the removal of
looping limit in softirq handling was the direct cause for making the
problem visible, it's very bothering that we have softirq runaway.
Finding out the perpetrator shouldn't be hard.  Something like the
following should work (untested).  Once we know which softirq (prolly
the network one), we can dig deeper.

Thanks.

diff --git a/kernel/softirq.c b/kernel/softirq.c
index b5197dc..5af3682 100644
--- a/kernel/softirq.c
+++ b/kernel/softirq.c
@@ -212,6 +212,7 @@ asmlinkage void __do_softirq(void)
 	unsigned long end = jiffies + MAX_SOFTIRQ_TIME;
 	int cpu;
 	unsigned long old_flags = current->flags;
+	int cnt = 0;
 
 	/*
 	 * Mask out PF_MEMALLOC s current task context is borrowed for the
@@ -244,6 +245,9 @@ restart:
 			kstat_incr_softirqs_this_cpu(vec_nr);
 
 			trace_softirq_entry(vec_nr);
+			if (++cnt >= 5000 && cnt < 5010)
+				printk("XXX __do_softirq: stuck handling softirqs, cnt=%d action=%pf\n",
+				       cnt, h->action);
 			h->action(h);
 			trace_softirq_exit(vec_nr);
 			if (unlikely(prev_count != preempt_count())) {


-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ