linux-kernel - Re: WARN_ON_ONCE() in process_one

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-Id: <20170621153035.GA31181@linux.vnet.ibm.com>
Date:   Wed, 21 Jun 2017 08:30:35 -0700
From:   "Paul E. McKenney" <paulmck@...ux.vnet.ibm.com>
To:     Tejun Heo <tj@...nel.org>
Cc:     jiangshanlai@...il.com, linux-kernel@...r.kernel.org
Subject: Re: WARN_ON_ONCE() in process_one_work()?

On Tue, Jun 20, 2017 at 09:45:23AM -0700, Paul E. McKenney wrote:
> On Sun, Jun 18, 2017 at 06:40:00AM -0400, Tejun Heo wrote:
> > Hello,
> > 
> > On Sat, Jun 17, 2017 at 10:31:05AM -0700, Paul E. McKenney wrote:
> > > On Sat, Jun 17, 2017 at 07:53:14AM -0400, Tejun Heo wrote:
> > > > Hello,
> > > > 
> > > > On Fri, Jun 16, 2017 at 10:36:58AM -0700, Paul E. McKenney wrote:
> > > > > And no test failures from yesterday evening.  So it looks like we get
> > > > > somewhere on the order of one failure per 138 hours of TREE07 rcutorture
> > > > > runtime with your printk() in the mix.
> > > > >
> > > > > Was the above output from your printk() output of any help?
> > > > 
> > > > Yeah, if my suspicion is correct, it'd require new kworker creation
> > > > racing against CPU offline, which would explain why it's so difficult
> > > > to repro.  Can you please see whether the following patch resolves the
> > > > issue?
> > > 
> > > That could explain why only Steve Rostedt and I saw the issue.  As far
> > > as I know, we are the only ones who regularly run CPU-hotplug stress
> > > tests.  ;-)
> > 
> > I was a bit confused.  It has to be racing against either new kworker
> > being created on the wrong CPU or rescuer trying to migrate to the
> > CPU, and it looks like we're mostly seeing the rescuer condition, but,
> > yeah, this would only get triggered rarely.  Another contributing
> > factor could be the vmstat work putting on a workqueue w/ rescuer
> > recently.  It runs quite often, so probably has increased the chance
> > of hitting the right condition.
> 
> Sounds like too much fun!  ;-)
> 
> But more constructively...  If I understand correctly, it is now possible
> to take a CPU partially offline and put it back online again.  This should
> allow much more intense testing of this sort of interaction.
> 
> And no, I haven't yet tried this with RCU because I would probably need
> to do some mix of just-RCU online/offline and full-up online-offline.
> Plus RCU requires pretty much a full online/offline cycle to fully
> exercise it.  :-/
> 
> > > I have a weekend-long run going, but will give this a shot overnight on
> > > Monday, Pacific Time.  Thank you for putting it together, looking forward
> > > to seeing what it does!
> > 
> > Thanks a lot for the testing and patience.  Sorry that it took so
> > long.  I'm not completely sure the patch is correct.  It might have to
> > be more specifc about which type of migration or require further
> > synchronization around migration, but hopefully it'll at least be able
> > to show that this was the cause of the problem.
> 
> And last night's tests had no failures.  Which might actually mean
> something, will get more info when I run without your patch this
> evening.  ;-)

And it didn't fail without the patch, either.  45 hours of test vs.
60 hours with the patch.  This one is not going to be easy to prove
either way.  I will try again this evening without the patch and see
what that gets us.

							Thanx, Paul