linux-kernel - Re: rcu_preempt self-detected stall on CPU from 4.5-rc3, since 3.17

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:	Sun, 27 Mar 2016 13:48:55 +0000 (UTC)
From:	Mathieu Desnoyers <mathieu.desnoyers@...icios.com>
To:	"Paul E. McKenney" <paulmck@...ux.vnet.ibm.com>
Cc:	"Chatre, Reinette" <reinette.chatre@...el.com>,
	Jacob Pan <jacob.jun.pan@...ux.intel.com>,
	Josh Triplett <josh@...htriplett.org>,
	Ross Green <rgkernel@...il.com>,
	John Stultz <john.stultz@...aro.org>,
	Thomas Gleixner <tglx@...utronix.de>,
	Peter Zijlstra <peterz@...radead.org>,
	lkml <linux-kernel@...r.kernel.org>,
	Ingo Molnar <mingo@...nel.org>,
	Lai Jiangshan <jiangshanlai@...il.com>, dipankar@...ibm.com,
	Andrew Morton <akpm@...ux-foundation.org>,
	rostedt <rostedt@...dmis.org>,
	David Howells <dhowells@...hat.com>,
	Eric Dumazet <edumazet@...gle.com>,
	Darren Hart <dvhart@...ux.intel.com>,
	Frédéric Weisbecker <fweisbec@...il.com>,
	Oleg Nesterov <oleg@...hat.com>,
	pranith kumar <bobby.prani@...il.com>
Subject: Re: rcu_preempt self-detected stall on CPU from 4.5-rc3, since 3.17

----- On Mar 26, 2016, at 9:34 PM, Paul E. McKenney paulmck@...ux.vnet.ibm.com wrote:

> On Sat, Mar 26, 2016 at 10:22:57PM +0000, Mathieu Desnoyers wrote:
>> ----- On Mar 26, 2016, at 2:49 PM, Paul E. McKenney paulmck@...ux.vnet.ibm.com
>> wrote:
>> 
>> > On Sat, Mar 26, 2016 at 08:28:16AM -0700, Paul E. McKenney wrote:
>> >> On Sat, Mar 26, 2016 at 12:29:31PM +0000, Mathieu Desnoyers wrote:
>> >> > ----- On Mar 25, 2016, at 5:46 PM, Paul E. McKenney paulmck@...ux.vnet.ibm.com
>> >> > wrote:
>> >> > 
>> >> > > On Fri, Mar 25, 2016 at 09:24:14PM +0000, Chatre, Reinette wrote:
>> >> > >> Hi  Paul,
>> >> > >> 
>> >> > >> On 2016-03-23, Paul E. McKenney wrote:
>> >> > >> > Please boot with the following parameters:
>> >> > >> > 
>> >> > >> > 	rcu_tree.rcu_kick_kthreads ftrace
>> >> > >> > trace_event=sched_waking,sched_wakeup,sched_wake_idle_without_ipi
>> >> > >> 
>> >> > >> With these parameters I expected more details to show up in the kernel logs but
>> >> > >> cannot find any. Even so, today I left the machine running again and when this
>> >> > >> happened I think I was able to capture the trace data for the event. Please
>> >> > >> find attached the trace information for the kernel message below. Since the
>> >> > >> complete trace file is very big I trimmed it to show the time around this event
>> >> > >> - hopefully this will contain the information you need. I would also like to
>> >> > >> provide some additional information. The system on which I see these events had
>> >> > >> a time that was _very_ wrong. I noticed that this issue occurs when
>> >> > >> system-timesynd was one of the tasks calling the functions of interest to your
>> >> > >> tracing and am wondering if a very out of sync time in process of being
>> >> > >> corrected could be the cause of this issue? As an experiment I ensured the
>> >> > >> system time was accurate before leaving the system idle overnight and I did not
>> >> > >> see the issue the next morning.
>> >> > > 
>> >> > > Ah!  Yes, a sudden jump in time or a disagreement about the time among
>> >> > > different components of the system can definitely cause these symptoms.
>> >> > > We have sometimes seen these problems occur when a pair of CPUs have
>> >> > > wildly different ideas about what time it is, for example.  Please let
>> >> > > me know how it goes.
>> >> > > 
>> >> > > Also, in your trace, there are no sched_waking events for the rcu_preempt
>> >> > > process that are not immediately followed by sched_wakeup, so your trace
>> >> > > isn't showing the problem that I am seeing.
>> >> > 
>> >> > This is interesting.
>> >> > 
>> >> > Perhaps we could try with those commits reverted ?
>> >> > 
>> >> > commit e3baac47f0e82c4be632f4f97215bb93bf16b342
>> >> > Author: Peter Zijlstra <peterz@...radead.org>
>> >> > Date:   Wed Jun 4 10:31:18 2014 -0700
>> >> > 
>> >> >     sched/idle: Optimize try-to-wake-up IPI
>> >> > 
>> >> > commit fd99f91aa007ba255aac44fe6cf21c1db398243a
>> >> > Author: Peter Zijlstra <peterz@...radead.org>
>> >> > Date:   Wed Apr 9 15:35:08 2014 +0200
>> >> > 
>> >> >     sched/idle: Avoid spurious wakeup IPIs
>> >> > 
>> >> > They appeared in 3.16.
>> >> 
>> >> At this point, I am up for trying pretty much anything.  ;-)
>> >> 
>> >> Will give it a go.
>> > 
>> > And those certainly don't revert cleanly!  Would patching the kernel
>> > to remove the definition of TIF_POLLING_NRFLAG be useful?  Or, more
>> > to the point, is there some other course of action that would be more
>> > useful?  At this point, the test times are measured in weeks...
>> 
>> Indeed, patching the kernel to remove the TIF_POLLING_NRFLAG
>> definition would have an effect similar to reverting those two
>> commits.
>> 
>> Since testing takes a while, we could take a more aggressive
>> approach towards reproducing a possible race condition: we
>> could re-implement the _TIF_POLLING_NRFLAG vs _TIF_NEED_RESCHED
>> dance, along with the ttwu pending lock-list queue, within
>> a dummy test module, with custom data structures, and
>> stress-test the invariants. We could also create a Promela
>> model of these ipi-skip optimisations trying to validate
>> progress: whenever a wakeup is requested, there should
>> always be a scheduling performed, even if no further wakeup
>> is encountered.
>> 
>> Each of the two approaches proposed above might be a significant
>> endeavor, and would only validate my specific hunch. So it might
>> be a good idea to just let a test run for a few weeks with
>> TIF_POLLING_NRFLAG disabled meanwhile.
> 
> This makes a lot of sense.  I did some short runs, and nothing broke
> too badly.  However, I left some diagnostic stuff in that obscured
> the outcome.  I disabled the diagnostic stuff and am running overnight.
> I might need to go further and revert some of my diagnostic patches,
> but let's see where it is in the morning.

Here is another idea that might help us reproduce this issue faster.
If you can afford it, you might want to just throw more similar hardware
at the problem. Assuming the problem shows up randomly, but its odds
of showing up make it happen only once per week, if we have 100 machines
idling in the same way in parallel, we should be able to reproduce it
within about 1-2 hours.

Of course, if the problem really need each machine to "degrade" for
a week (e.g. memory fragmentation), that would not help. It's only for
races that appear to be showing up randomly.

Thoughts ?

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com