lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20130626081057.GB10796@concordia>
Date:	Wed, 26 Jun 2013 18:10:58 +1000
From:	Michael Ellerman <michael@...erman.id.au>
To:	"Paul E. McKenney" <paulmck@...ux.vnet.ibm.com>
Cc:	linuxppc-dev <linuxppc-dev@...ts.ozlabs.org>,
	Rojhalat Ibrahim <imr@...chenk.de>,
	Steven Rostedt <rostedt@...dmis.org>,
	linux-kernel@...r.kernel.org
Subject: Re: Regression in RCU subsystem in latest mainline kernel

On Tue, Jun 25, 2013 at 09:03:32AM -0700, Paul E. McKenney wrote:
> On Tue, Jun 25, 2013 at 05:44:23PM +1000, Michael Ellerman wrote:
> > On Tue, Jun 25, 2013 at 05:19:14PM +1000, Michael Ellerman wrote:
> > > 
> > > Here's another trace from 3.10-rc7 plus a few local patches.
> > 
> > And here's another with CONFIG_RCU_CPU_STALL_INFO=y in case that's useful:
> > 
> > PASS running test_pmc5_6_overuse()
> > INFO: rcu_sched self-detected stall on CPU
> > 	8: (1 GPs behind) idle=8eb/140000000000002/0 softirq=215/220 
> 
> So this CPU has been out of action since before the beginning of the
> current grace period ("1 GPs behind").  It is not idle, having taken
> a pair of nested interrupts from process context (matching the stack
> below).  This CPU has take five softirqs since the last grace period
> that it noticed, which makes it likely that the loop is within the
> softirq handler.
> 
> > 	 (t=2100 jiffies g=18446744073709551583 c=18446744073709551582 q=13)
> 
> Assuming HZ=100, this stall has been going on  for 21 seconds.  There
> is a grace period in progress according to RCU's global state (which
> this CPU is not yet aware of).  There are a total of 13 RCU callbacks
> queued across the entire system.
> 
> If the system is at all responsive, I suggest using ftrace (either from
> the boot command line or at runtime) to trace __do_softirq() and
> hrtimer_interrupt().

Thanks for decoding it Paul.

I've narrowed down the test case and I think this is probably just a
case of too many perf interrupts. If I reduce the sampling period by
half the test runs fine.

There is logic in perf to detect an interrupt storm, but for some reason
it's not saving us. I'll dig in there, but I don't think it's an RCU
problem.

cheers
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ