These are some notes on the scheduler locking and how it provides program order guarantees on SMP systems. Cc: Linus Torvalds Cc: Will Deacon Cc: Oleg Nesterov Cc: Boqun Feng Cc: "Paul E. McKenney" Cc: Jonathan Corbet Cc: Michal Hocko Cc: David Howells Signed-off-by: Peter Zijlstra (Intel) --- kernel/sched/core.c | 142 ++++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 142 insertions(+) --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -1905,6 +1905,148 @@ static void ttwu_queue(struct task_struc raw_spin_unlock(&rq->lock); } +/* + * Notes on Program-Order guarantees on SMP systems. + * + * + * PREEMPTION/MIGRATION + * + * Regular preemption/migration is safe because as long as the task is runnable + * migrations involve both rq locks, albeit not (necessarily) at the same time. + * + * So we get (we allow 3 CPU migrations): + * + * CPU0 CPU1 CPU2 + * + * LOCK rq(0)->lock + * sched-out X + * sched-in Y + * UNLOCK rq(0)->lock + * + * LOCK rq(0)->lock // MB against CPU0 + * dequeue X + * UNLOCK rq(0)->lock + * + * LOCK rq(1)->lock + * enqueue X + * UNLOCK rq(1)->lock + * + * LOCK rq(1)->lock // MB against CPU2 + * sched-out Z + * sched-in X + * UNLOCK rq(1)->lock + * + * and the first LOCK rq(0) on CPU2 gives a full order against the UNLOCK rq(0) + * on CPU0. Similarly the LOCK rq(1) on CPU1 provides full order against the + * UNLOCK rq(1) on CPU2, therefore by the time task X runs on CPU1 it must + * observe the state it left behind on CPU0. + * + * + * BLOCKING -- aka. SLEEP + WAKEUP + * + * For blocking things are a little more interesting, because when we dequeue + * the task, we don't need to acquire the old rq lock in order to migrate it. + * + * Say CPU0 does a wait_event() and CPU1 does the wake() and migrates the task + * to CPU2 (the most complex example): + * + * CPU0 (schedule) CPU1 (try_to_wake_up) CPU2 (sched_ttwu_pending) + * + * X->state = UNINTERRUPTIBLE + * MB + * if (cond) + * break + * cond = true + * + * LOCK rq(0)->lock LOCK X->pi_lock + * + * dequeue X + * while (X->on_cpu) + * cpu_relax() + * sched-out X + * RELEASE + * X->on_cpu = 0 + * RMB + * X->state = WAKING + * set_task_cpu(X,2) + * WMB + * ti(X)->cpu = 2 + * + * llist_add(X, rq(2)) // MB + * llist_del_all() // MB + * + * LOCK rq(2)->lock + * enqueue X + * X->state = RUNNING + * UNLOCK rq(2)->lock + * + * LOCK rq(2)->lock + * sched-out Z + * sched-in X + * UNLOCK rq(1)->lock + * + * if (cond) // _TRUE_ + * UNLOCK X->pi_lock + * UNLOCK rq(0)->lock + * + * So in this case the scheduler does not provide an obvious full barrier; but + * the smp_store_release() in finish_lock_switch(), paired with the control-dep + * and smp_rmb() in try_to_wake_up() form a release-acquire pair and fully + * order things between CPU0 and CPU1. + * + * The llist primitives order things between CPU1 and CPU2 -- the alternative + * is CPU1 doing the remote enqueue (the alternative path in ttwu_queue()) in + * which case the rq(2)->lock release/acquire will order things between them. + * + * Which again leads to the guarantee that by the time X gets to run on CPU2 + * it must observe the state it left behind on CPU0. + * + * However; for blocking there is a second guarantee we must provide, namely we + * must observe the state that lead to our wakeup. That is, not only must X + * observe its own prior state, it must also observe the @cond store. + * + * This too is achieved in the above, since CPU1 does the waking, we only need + * the ordering between CPU1 and CPU2, which is the same as the above. + * + * + * There is however a much more interesting case for this guarantee, where X + * never makes it off CPU0: + * + * CPU0 (schedule) CPU1 (try_to_wake_up) + * + * X->state = UNINTERRUPTIBLE + * MB + * if (cond) + * break + * cond = true + * + * WMB (aka smp_mb__before_spinlock) + * LOCK X->pi_lock + * + * if (X->on_rq) + * LOCK rq(0)->lock + * X->state = RUNNING + * UNLOCK rq(0)->lock + * + * LOCK rq(0)->lock // MB against CPU1 + * UNLOCK rq(0)->lock + * + * if (cond) // _TRUE_ + * + * UNLOCK X->pi_lock + * + * Here our task X never quite leaves CPU0, the wakeup happens before we can + * dequeue and schedule someone else. In this case we must still observe cond + * after our call to schedule() completes. + * + * This is achieved by the smp_mb__before_spinlock() WMB which ensures the store + * cannot leak inside the LOCK, and LOCK rq(0)->lock on CPU0 provides full order + * against the UNLOCK rq(0)->lock from CPU1. Furthermore our load of cond cannot + * happen before this same LOCK. + * + * Therefore, again, we're good. + */ + /** * try_to_wake_up - wake up a thread * @p: the thread to be awakened -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/