[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-Id: <1259311550.6110.241.camel@marge.simson.net>
Date: Fri, 27 Nov 2009 09:45:50 +0100
From: Mike Galbraith <efault@....de>
To: Peter Zijlstra <peterz@...radead.org>
Cc: Ingo Molnar <mingo@...e.hu>, LKML <linux-kernel@...r.kernel.org>
Subject: Re: [patch] sched: fix b5d9d734 blunder in task_new_fair()
Off list
On Thu, 2009-11-26 at 17:26 +0100, Peter Zijlstra wrote:
> On Sun, 2009-11-22 at 13:07 +0100, Mike Galbraith wrote:
> > @@ -2589,16 +2588,10 @@ void sched_fork(struct task_struct *p, i
> > */
> > p->prio = current->normal_prio;
> >
> > - if (!rt_prio(p->prio))
> > + if (!task_has_rt_policy(p))
> > p->sched_class = &fair_sched_class;
> >
> > -#ifdef CONFIG_SMP
> > - cpu = p->sched_class->select_task_rq(p, SD_BALANCE_FORK, 0);
> > -#endif
> > - local_irq_save(flags);
> > - update_rq_clock(cpu_rq(cpu));
> > - set_task_cpu(p, cpu);
> > - local_irq_restore(flags);
> > + __set_task_cpu(p, cpu);
>
> OK, so I figured out why it was in sched_fork() and not in
> wake_up_new_task().
>
> It is because in sched_fork() the new task isn't in the tasklist yet and
> can therefore not race with any other migration logic.
All the raciness I'm fretting over probably just doesn't matter much.
Things aren't exploding. Maybe min_vruntime is the only thing I should
be worrying about. That concern is in-flight deltas of SCHED_IDLE
magnitude, ie cross cpu "fuzziness" on a very large scale.
However :-/ (aw sh*t, here i go again. aaaaOOOOOgah! dive! dive!;)
WRT affinity, sched_class, nice level fretting, that can all change from
userland at any instant that you do not hold the task's runqueue lock
and the tasklist lock is not held by somebody to keep them from getting
a task reference to start the ball rolling. As soon as you drop the
runqueue lock, userland can acquire, and change whatever it likes under
you, so afaikt, we can call the wrong sched_class method etc etc.
3f029d3 agrees fully wrt sched_class at least:
In addition, we fix a race condition where we try to access
current->sched_class without holding the rq->lock. This is
technically racy, as the sched-class could change out from
under us. Instead, we reference the per-rq post_schedule
variable with the runqueue unlocked, but with preemption
disabled to see if we need to reacquire the rq->lock.
The only thing that really changes with the unlocked _rummaging_ is that
we now can't count on nr_running/load on the task's current runqueue,
sched_class etc while you're rummaging, ALL state is fuzzy, instead of
only most.
However, I don't think we can even count on the task remaining untouched
while in TASK_WAKING state, and that might be a bigger deal.
afaikt, userland can migrate the task you're in the process of waking
while you're off rummaging around looking for a place to put it, like
so: Wakee is on the tasklist, can be accessed by userland. We wouldn't
be in ttwu either were it not. We're waking, we set task state to
TASK_WAKING, release the lock, userland acquires, nobody but ttwu has
ever heard of a TASK_WAKING, so it happily changes task's affinity,
migrates the sleeping task to the one and only (pins) correct runqueue,
sets task cpu etc, releases the lock, and goes home. We finish
rummaging, do NOT check for an intervening affinity change, instead, we
do an unlocked scribble over what userland just wrote, resetting cpu and
vruntime to a now illegal cpu, and activate. I'm not seeing any
inhibitor for this scenario.
When I moved fork balancing runqueue selection to the much more logical
wakeup and enqueue time, vs copy and fiddle time, I didn't fix anything,
I merely duplicated the races that are now in ttwu.
No matter were we do the selection, we can race with userland if the
darn thing isn't locked all the while. With .31 ttwu locking, there is
no race, because nobody can get to the task struct. If target cpu
changes via rq selection, we set cpu, _then_ unlock, at which point
userland or whomever may override _our_ decision, but we never write
after re-acquiring, so intervening changes, if any, stay intact.
With an exec, userland policy/affinity change will deactivate/activate
or do a migration call. We don't have the thing locked while we're
rummaging, so what keeps sched_class from changing after we evaluated,
so we call the wrong method, and then do our own migration call?
/me is still pretty befuddled, and haven't even crawled over PI.
I flat don't see how we can do this race free, unless we put every task
in some untouchable state while we're rummaging, and teach everything
having to do with migration about that state.
-Mike
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists