linux-kernel - Re: rcu_preempt self-detected stall on CPU from 4.5-rc3, since 3.17

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <CANfgCY0sy7Fib6yihxFrj+iZx7Y8JkBKMcw2Fjc7JX_58MuFiQ@mail.gmail.com>
Date:	Thu, 18 Feb 2016 22:51:29 +1100
From:	Ross Green <rgkernel@...il.com>
To:	Paul McKenney <paulmck@...ux.vnet.ibm.com>
Cc:	Peter Zijlstra <peterz@...radead.org>,
	linux-kernel@...r.kernel.org, mingo@...nel.org,
	jiangshanlai@...il.com, dipankar@...ibm.com,
	akpm@...ux-foundation.org,
	Mathieu Desnoyers <mathieu.desnoyers@...icios.com>,
	josh@...htriplett.org, tglx@...utronix.de, rostedt@...dmis.org,
	dhowells@...hat.com, Eric Dumazet <edumazet@...gle.com>,
	dvhart@...ux.intel.com,
	Frédéric Weisbecker <fweisbec@...il.com>,
	oleg@...hat.com, pranith kumar <bobby.prani@...il.com>
Subject: Re: rcu_preempt self-detected stall on CPU from 4.5-rc3, since 3.17

On Thu, Feb 18, 2016 at 10:19 AM, Paul E. McKenney
<paulmck@...ux.vnet.ibm.com> wrote:
> On Wed, Feb 17, 2016 at 12:28:29PM -0800, Paul E. McKenney wrote:
>> On Wed, Feb 17, 2016 at 08:45:54PM +0100, Peter Zijlstra wrote:
>> > On Wed, Feb 17, 2016 at 11:28:17AM -0800, Paul E. McKenney wrote:
>> > > On Tue, Feb 16, 2016 at 09:45:49PM -0800, Paul E. McKenney wrote:
>> > > > On Tue, Feb 09, 2016 at 09:11:55PM +1100, Ross Green wrote:
>> > > > > Continued testing with the latest linux-4.5-rc3 release.
>> > > > >
>> > > > > Please find attached a copy of traces from dmesg:
>> > > > >
>> > > > > There is a lot more debug and trace data so hopefully this will shed
>> > > > > some light on what might be happening here.
>> > > > >
>> > > > > My testing remains run a series of simple benchmarks, let that run to
>> > > > > completion and then leave the system idle away with just a few daemons
>> > > > > running.
>> > > > >
>> > > > > the self detected stalls in this instance turned up after a days run time.
>> > > > > There were  NO heavy artificial computational loads on the machine.
>> > > >
>> > > > It does indeed look quiet on that dmesg for a good long time.
>> > > >
>> > > > The following insanely crude not-for-mainline hack -might- be producing
>> > > > good results in my testing.  It will take some time before I can claim
>> > > > statistically different results.  But please feel free to give it a go
>> > > > in the meantime.  (Thanks to Al Viro for pointing me in this direction.)
>> >
>> > Your case was special in that is was hotplug triggering it, right?
>>
>> Yes, it has thus far only shown up with CPU hotplug enabled.
>>
>> > I was auditing the hotplug paths involved when I fell ill two weeks ago,
>> > and have not really made any progress on that because of that :/
>>
>> I have always said that being sick is bad for one's health, but I didn't
>> realize that it could be bad for the kernel's health as well.  ;-)
>>
>> > I'll go have another look, I had a vague feeling for a race back then,
>> > lets see if I can still remember how..
>>
>> I believe that I can -finally- get an ftrace_dump() to happen within
>> 10-20 milliseconds of the problem, which just might be soon enough
>> after the problem to gather some useful information.  I am currently
>> testing this theory with "ftrace trace_event=sched_waking,sched_wakeup"
>> boot arguments on a two-hour run.
>
> And apparently another way to greatly reduce the probability of this
> bug occurring is to enable ftrace.  :-/
>
> Will try longer runs.
>
>                                                         Thanx, Paul
>
>> If this works out, what would be a useful set of trace events for me
>> to capture?
>>
>>                                                       Thanx, Paul
>

Well managed to catch this one on linux-4.5-rc4.

Took over 3 days and 7 hours to hit.

Same test as before, boot, run a series of simple benchmarks and then
let the machine just idle away.

As before, the reported stall, AND everything keeps on running as if
nothing had happened.

I notice in the task dump for both the cpus, the swapper is running on
both cpus.

does that make any sense?
There is around 3% of memory actually used.

Anyway, please find attached a copy of the dmesg output.

Hope this helps a few people fill in the missing pieces here.

Regards,

Ross Green

Download attachment "dmesg-4.5-rc4" of type "application/octet-stream" (30353 bytes)