linux-kernel - Re: Regression in RCU subsystem in latest mainline kernel

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [day] [month] [year] [list]

Message-ID: <20130626141617.GJ3828@linux.vnet.ibm.com>
Date:	Wed, 26 Jun 2013 07:16:17 -0700
From:	"Paul E. McKenney" <paulmck@...ux.vnet.ibm.com>
To:	Michael Ellerman <michael@...erman.id.au>
Cc:	linuxppc-dev <linuxppc-dev@...ts.ozlabs.org>,
	Rojhalat Ibrahim <imr@...chenk.de>,
	Steven Rostedt <rostedt@...dmis.org>,
	linux-kernel@...r.kernel.org
Subject: Re: Regression in RCU subsystem in latest mainline kernel

On Wed, Jun 26, 2013 at 06:10:58PM +1000, Michael Ellerman wrote:
> On Tue, Jun 25, 2013 at 09:03:32AM -0700, Paul E. McKenney wrote:
> > On Tue, Jun 25, 2013 at 05:44:23PM +1000, Michael Ellerman wrote:
> > > On Tue, Jun 25, 2013 at 05:19:14PM +1000, Michael Ellerman wrote:
> > > > 
> > > > Here's another trace from 3.10-rc7 plus a few local patches.
> > > 
> > > And here's another with CONFIG_RCU_CPU_STALL_INFO=y in case that's useful:
> > > 
> > > PASS running test_pmc5_6_overuse()
> > > INFO: rcu_sched self-detected stall on CPU
> > > 	8: (1 GPs behind) idle=8eb/140000000000002/0 softirq=215/220 
> > 
> > So this CPU has been out of action since before the beginning of the
> > current grace period ("1 GPs behind").  It is not idle, having taken
> > a pair of nested interrupts from process context (matching the stack
> > below).  This CPU has take five softirqs since the last grace period
> > that it noticed, which makes it likely that the loop is within the
> > softirq handler.
> > 
> > > 	 (t=2100 jiffies g=18446744073709551583 c=18446744073709551582 q=13)
> > 
> > Assuming HZ=100, this stall has been going on  for 21 seconds.  There
> > is a grace period in progress according to RCU's global state (which
> > this CPU is not yet aware of).  There are a total of 13 RCU callbacks
> > queued across the entire system.
> > 
> > If the system is at all responsive, I suggest using ftrace (either from
> > the boot command line or at runtime) to trace __do_softirq() and
> > hrtimer_interrupt().
> 
> Thanks for decoding it Paul.
> 
> I've narrowed down the test case and I think this is probably just a
> case of too many perf interrupts. If I reduce the sampling period by
> half the test runs fine.
> 
> There is logic in perf to detect an interrupt storm, but for some reason
> it's not saving us. I'll dig in there, but I don't think it's an RCU
> problem.

Whew!  ;-)

							Thanx, Paul

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/