[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20170618104000.GC28042@htj.duckdns.org>
Date:   Sun, 18 Jun 2017 06:40:00 -0400
From:   Tejun Heo <tj@...nel.org>
To:     "Paul E. McKenney" <paulmck@...ux.vnet.ibm.com>
Cc:     jiangshanlai@...il.com, linux-kernel@...r.kernel.org
Subject: Re: WARN_ON_ONCE() in process_one_work()?
Hello,
On Sat, Jun 17, 2017 at 10:31:05AM -0700, Paul E. McKenney wrote:
> On Sat, Jun 17, 2017 at 07:53:14AM -0400, Tejun Heo wrote:
> > Hello,
> > 
> > On Fri, Jun 16, 2017 at 10:36:58AM -0700, Paul E. McKenney wrote:
> > > And no test failures from yesterday evening.  So it looks like we get
> > > somewhere on the order of one failure per 138 hours of TREE07 rcutorture
> > > runtime with your printk() in the mix.
> > >
> > > Was the above output from your printk() output of any help?
> > 
> > Yeah, if my suspicion is correct, it'd require new kworker creation
> > racing against CPU offline, which would explain why it's so difficult
> > to repro.  Can you please see whether the following patch resolves the
> > issue?
> 
> That could explain why only Steve Rostedt and I saw the issue.  As far
> as I know, we are the only ones who regularly run CPU-hotplug stress
> tests.  ;-)
I was a bit confused.  It has to be racing against either new kworker
being created on the wrong CPU or rescuer trying to migrate to the
CPU, and it looks like we're mostly seeing the rescuer condition, but,
yeah, this would only get triggered rarely.  Another contributing
factor could be the vmstat work putting on a workqueue w/ rescuer
recently.  It runs quite often, so probably has increased the chance
of hitting the right condition.
> I have a weekend-long run going, but will give this a shot overnight on
> Monday, Pacific Time.  Thank you for putting it together, looking forward
> to seeing what it does!
Thanks a lot for the testing and patience.  Sorry that it took so
long.  I'm not completely sure the patch is correct.  It might have to
be more specifc about which type of migration or require further
synchronization around migration, but hopefully it'll at least be able
to show that this was the cause of the problem.
Thanks!
-- 
tejun
Powered by blists - more mailing lists
 
