linux-kernel - Re: 3.0-git15 Atomic scheduling in pidmap

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20110805231216.GE22164@somewhere>
Date:	Sat, 6 Aug 2011 01:12:18 +0200
From:	Frederic Weisbecker <fweisbec@...il.com>
To:	"Paul E. McKenney" <paulmck@...ux.vnet.ibm.com>
Cc:	Josh Boyer <jwboyer@...hat.com>, linux-kernel@...r.kernel.org,
	Andrew Morton <akpm@...ux-foundation.org>
Subject: Re: 3.0-git15 Atomic scheduling in pidmap_init

On Fri, Aug 05, 2011 at 03:26:41PM -0700, Paul E. McKenney wrote:
> On Fri, Aug 05, 2011 at 07:08:08PM +0200, Frederic Weisbecker wrote:
> > On Fri, Aug 05, 2011 at 10:22:45AM -0400, Josh Boyer wrote:
> > > On Thu, Aug 04, 2011 at 11:56:46PM -0700, Paul E. McKenney wrote:
> > > > > > I could be missing something obvious, but I don't see a way to avoid
> > > > > > using GFP_KERNEL without a lot of rip-up in the rest of the init path.
> > > > > 
> > > > > As an aside, I bisected this back to:
> > > > > 
> > > > > e8f7c70f44f sched: Make sleeping inside spinlock detection working in
> > > > > !CONFIG_PREEMPT
> > > > 
> > > > OK, added Frederic on CC.
> > > > 
> > > > > However, that doesn't seem all that helpful.  The
> > > > > CONFIG_DEBUG_SPINLOCK_SLEEP option later got renamed to
> > > > > DEBUG_ATOMIC_SLEEP, and all it's doing is selecting PREEMPT_COUNT.  At
> > > > > first glance, it seems this commit just allowed an issue that's been
> > > > > around for a while (benign or otherwise) to finally show up.
> > > > > 
> > > > > (The Fedora kernel configs have CONFIG_PREEMPT_VOLUNTARY set, but not
> > > > > CONFIG_PREEMPT so PREEMPT_COUNT wasn't getting selected until this
> > > > > option did so.)
> > > > 
> > > > Understood.  So my question is "what is the real way to fix this?"
> > > > Within RCU, I would probably wrapper the calls to set_need_resched()
> > > > so that it checks for the scheduler being fully alive.  Except for the
> > > > call from rcu_enter_nohz(), of course -- if that one is called before
> > > > the scheduler is ready, then that is a bug that needs to be fixed.
> > > 
> > > By scheduler being fully alive, do you mean when rcu_scheduler_starting
> > > is called?  Or do you mean the actual scheduler, because sched_init is
> > > called well before any of this happens.
> > > 
> > > > Nevertheless, I am wondering if all of this isn't really papering over
> > > > some real problem somewhere.  The way we get to this place is from people
> > > > registering RCU callbacks during early boot, which is OK in and of itself,
> > > > at least in moderation.  But if someone is expecting those callbacks to be
> > > > invoked before the scheduler is fully set up and running multiple tasks,
> > > > they are going to be disappointed.
> > > 
> > > Is there a way to dump what callbacks have been registered?  As far as I
> > > can see, we call rcu_check_callbacks unconditionally when a timer
> > > interrupt is taken and that calls rcu_pending unconditionally as well.
> > > Before that, rcu_init is called which eventually sets up the per-cpu
> > > data via rcu_init_percpu_data and that sets rdp->qs_pending = 1.
> > > Until a quiescent state is reached __rcu_pending is going to try and
> > > force it, which is where the set_need_resched is called.
> > > 
> > > Basically, I took what you said about wrapping set_need_resched and came
> > > up with the patch below.  It gets rid of the oops from pidmap_init, but
> > > I need to test it a bit more.  Would be happy to have feedback.
> > > 
> > > josh
> > > 
> > > ---
> > > 
> > > diff --git a/kernel/rcutree.c b/kernel/rcutree.c
> > > index ba06207..8c6cb6e 100644
> > > --- a/kernel/rcutree.c
> > > +++ b/kernel/rcutree.c
> > > @@ -1681,8 +1681,14 @@ static int __rcu_pending(struct rcu_state *rsp, struct rcu_data *rdp)
> > >  		rdp->n_rp_qs_pending++;
> > >  		if (!rdp->preemptible &&
> > >  		    ULONG_CMP_LT(ACCESS_ONCE(rsp->jiffies_force_qs) - 1,
> > > -				 jiffies))
> > > -			set_need_resched();
> > > +				 jiffies)) {
> > > +			/* Make sure we're ready to mark the task as needing
> > > + 			 * rescheduling otherwise we can trigger oopes early
> > > + 			 * in the init path
> > > + 			 */
> > > +			if (rcu_scheduler_active)
> > > +				set_need_resched();
> > 
> > What about we avoid setting rdp->qs_pending = 1 for the CPU
> > that handles the boot?
> 
> That sounds promising -- only checked at the beginning of a grace period,
> so not too much overhead.  In contrast, __rcu_pending() is called
> multiple times per transition to dyntick-idle state.
> 
> I will take a look at this.
> 
> 							Thanx, Paul

I actually thought it could be done from rcu_init_percpu_data(). This
is where we initialize the qs_pending to 1, which I believe is responsible
for that set_need_resched() from rcu soon after on boot.

It's possible we also have secondary offenders in places that enqueue
rcu callbacks in the boot. But if not, then we are fine with tweaking
that qs_pending on cpu boot.

Because the initial qs_pending = 1 in rcu_init_percpu_data()
is useful only if a grace period has started, but by the time
rcu_init() is called I guess it can't be the case.

> 
> > > +		}
> > >  	} else if (rdp->qs_pending && rdp->passed_quiesc) {
> > >  		rdp->n_rp_report_qs++;
> > >  		return 1;
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/