[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAEXW_YTcL-nOfJXkChGhvQtqqfSLpAYr327PLu1SmGEEADCevw@mail.gmail.com>
Date: Thu, 18 Jul 2019 12:14:22 -0400
From: Joel Fernandes <joel@...lfernandes.org>
To: "Paul E. McKenney" <paulmck@...ux.ibm.com>
Cc: Byungchul Park <max.byungchul.park@...il.com>,
Byungchul Park <byungchul.park@....com>,
rcu <rcu@...r.kernel.org>, LKML <linux-kernel@...r.kernel.org>,
kernel-team@....com
Subject: Re: [PATCH] rcu: Make jiffies_till_sched_qs writable
Trimming the list a bit to keep my noise level low,
On Sat, Jul 13, 2019 at 1:41 PM Paul E. McKenney <paulmck@...ux.ibm.com> wrote:
[snip]
> > It still feels like you guys are hyperfocusing on this one particular
> > > knob. I instead need you to look at the interrelating knobs as a group.
> >
> > Thanks for the hints, we'll do that.
> >
> > > On the debugging side, suppose someone gives you an RCU bug report.
> > > What information will you need? How can you best get that information
> > > without excessive numbers of over-and-back interactions with the guy
> > > reporting the bug? As part of this last question, what information is
> > > normally supplied with the bug? Alternatively, what information are
> > > bug reporters normally expected to provide when asked?
> >
> > I suppose I could dig out some of our Android bug reports of the past where
> > there were RCU issues but if there's any fires you are currently fighting do
> > send it our way as debugging homework ;-)
>
> Suppose that you were getting RCU CPU stall
> warnings featuring multi_cpu_stop() called from cpu_stopper_thread().
> Of course, this really means that some other CPU/task is holding up
> multi_cpu_stop() without also blocking the current grace period.
>
So I took a shot at this trying to learn how CPU stoppers work in
relation to this problem.
I am assuming here say CPU X has entered MULTI_STOP_DISABLE_IRQ state
in multi_cpu_stop() but another CPU Y has not yet entered this state.
So CPU X is stalling RCU but it is really because of CPU Y. Now in the
problem statement, you mentioned CPU Y is not holding up the grace
period, which means Y doesn't have any of IRQ, BH or preemption
disabled ; but is still somehow stalling RCU indirectly by troubling
X.
This can only happen if :
- CPU Y has a thread executing on it that is higher priority than CPU
X's stopper thread which prevents it from getting scheduled. - but the
CPU stopper thread (migration/..) is highest priority RT so this would
be some kind of an odd scheduler bug.
- There is a bug in the CPU stopper machinery itself preventing it
from scheduling the stopper on Y. Even though Y is not holding up the
grace period.
Did I get that right? Would be exciting to run the rcutorture test
once Paul has it available to reproduce this problem.
thanks,
- Joel
Powered by blists - more mailing lists