[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <3b4684ea-5c0d-376b-19cf-195684ec4e0e@efficios.com>
Date: Fri, 7 Apr 2023 21:14:36 -0400
From: Mathieu Desnoyers <mathieu.desnoyers@...icios.com>
To: Peter Zijlstra <peterz@...radead.org>
Cc: linux-kernel@...r.kernel.org, Aaron Lu <aaron.lu@...el.com>,
Olivier Dion <odion@...icios.com>, michael.christie@...cle.com
Subject: Re: [RFC PATCH v3] sched: Fix performance regression introduced by
mm_cid
On 2023-04-07 19:50, Mathieu Desnoyers wrote:
[...]
>
> Let's introduce task (Y) which has mm == src_task->mm and task (N) which
> has
> mm != src_task->mm for the rest of the discussion.
>
> Let's consider the scheduling state transitions we want to consider here.
> There are two scheduler state transitions on context switch we care about:
>
> (TSA) Store to rq->curr with transition from (N) to (Y)
>
> (TSB) Store to rq->curr with transition from (Y) to (N)
>
> On the migrate-from side, there is one transition we care about:
>
> (TMA) cmpxchg to *pcpu_cid to set the LAZY flag
>
> There is also a transition to UNSET state which can be performed from all
> sides (scheduler, migrate-from). It is performed with a cmpxchg everywhere
> which guarantees that only a single thread will succeed:
>
> (TMB) cmpxchg to *pcpu_cid to mark UNSET
>
> Just to be clear (at the risk of repeating myself), what we do _not_ want
> to happen is a transition to UNSET when a thread is actively using the cid
> (property (1)). And ideally we do not want to leak a cid after migrating
> the
> last task from a cpu (property (2)).
>
> Let's looks at the relevant combinations of TSA/TSB, and TMA transitions.
>
> Scenario A) (TSA)+(TMA) (from next task perspective)
>
> CPU0 CPU1
>
> Context switch CS-1 Migrate-from
> - store to rq->curr: (N)->(Y) (TSA) - cmpxchg to
> *pcpu_id to LAZY (TMA)
> *** missing barrier ?? *** (implied barrier
> after cmpxchg)
> - prepare_task_switch()
> - switch_mm_cid()
> - mm_cid_get (next)
> - READ_ONCE(*pcpu_cid) -
> rcu_dereference(src_rq->curr)
>
> This Dekker ensures that either task (Y) is observed by the
> rcu_dereference() or the LAZY
> flag is observed by READ_ONCE(), or both are observed.
>
> If task (Y) store is observed by rcu_dereference(), it means that there
> is still
> an active task on the cpu. Migrate-from will therefore not transition to
> UNSET, which
> fulfills property (1). That observed task will itself eventually need a
> migrate-from
> to be migrated away from that cpu, which fulfills property (2).
>
> If task (Y) is not observed, but the lazy flag is observed by
> READ_ONCE(), it will
> move its state to UNSET, which clears the percpu cid perhaps uselessly
> (which is not
> an issue for correctness). Because task (Y) is not observed, CPU1 can
> move ahead to
> set the state to UNSET. Because moving state to UNSET is done with a
> cmpxchg expecting
> that the old state has the LAZY flag set, only one thread will
> successfully UNSET.
>
> If both states (LAZY flag and task (Y)) are observed, the thread on CPU0
> will observe
> the LAZY flag and transition to UNSET (perhaps uselessly), and CPU1 will
> observe task
> (Y) and do nothing more, which is fine.
>
> What we are effectively preventing with this Dekker is a scenario where
> neither LAZY
> flag nor store (Y) are observed, which would fail property (1) because
> this would
> UNSET a cid which is actively used.
>
>
> Scenario B) (TSB)+(TMA) (from prev task perspective)
>
> CPU0 CPU1
>
> Context switch CS-1 Migrate-from
> - store to rq->curr: (Y)->(N) (TSB) - cmpxchg to
> *pcpu_id to LAZY (TMA)
> *** missing barrier ?? *** (implied barrier
> after cmpxchg)
> - prepare_task_switch()
> - switch_mm_cid()
> - cid_put_lazy() (prev)
> - READ_ONCE(*pcpu_cid) -
> rcu_dereference(src_rq->curr)
>
> This Dekker ensures that either task (N) is observed by the
> rcu_dereference() or the LAZY
> flag is observed by READ_ONCE(), or both are observed.
>
> If rcu_dereference observes (N) but LAZY is not observed, migrate-from
> will take care to
> advance the state to UNSET, thus fulfilling property (2). Because (Y) is
> not running anymore
> property (1) is fulfilled.
>
> If rcu_dereference does not observe (N), but LAZY is observed,
> migrate-from does not
> advance to UNSET because it observes (Y), but LAZY flag will make the
> task on CPU0
> take care of advancing the state to UNSET, thus fulfilling property (2).
>
> If both (N) and LAZY are observed, both migrate-from and CPU0 will try
> to advance the
> state to UNSET, but only one will succeed its cmpxchg.
>
> What we are effectively preventing with this Dekker is a scenario where
> neither LAZY
> flag nor store (N) are observed, which would fail property (2) because
> it would leak
> a cid on a cpu that has no task left using the mm.
>
> So based on my analysis, we are indeed missing a barrier between store
> to rq->curr and
> load of the per-mm/cpu cid within context_switch().
>
> Am I missing something ? How can we provide this barrier with minimal
> overhead ?
Here is the barrier placement I have in mind to minimize the impact:
diff --git a/include/linux/sched/mm.h b/include/linux/sched/mm.h
index 2a243616f222..f20fc0600fcc 100644
--- a/include/linux/sched/mm.h
+++ b/include/linux/sched/mm.h
@@ -37,6 +37,11 @@ static inline void mmgrab(struct mm_struct *mm)
atomic_inc(&mm->mm_count);
}
+static inline void smp_mb__after_mmgrab(void)
+{
+ smp_mb__after_atomic();
+}
+
extern void __mmdrop(struct mm_struct *mm);
static inline void mmdrop(struct mm_struct *mm)
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 9e0fa4193499..8d410c0dcb39 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5117,7 +5117,6 @@ prepare_task_switch(struct rq *rq, struct task_struct *prev,
sched_info_switch(rq, prev, next);
perf_event_task_sched_out(prev, next);
rseq_preempt(prev);
- switch_mm_cid(prev, next);
fire_sched_out_preempt_notifiers(prev, next);
kmap_local_sched_out();
prepare_task(next);
@@ -5273,6 +5272,9 @@ context_switch(struct rq *rq, struct task_struct *prev,
*
* kernel -> user switch + mmdrop() active
* user -> user switch
+ *
+ * switch_mm_cid() needs to be updated if the barriers provided
+ * by context_switch() are modified.
*/
if (!next->mm) { // to kernel
enter_lazy_tlb(prev->active_mm, next);
@@ -5302,6 +5304,9 @@ context_switch(struct rq *rq, struct task_struct *prev,
}
}
+ /* switch_mm_cid() requires the memory barriers above. */
+ switch_mm_cid(prev, next);
+
rq->clock_update_flags &= ~(RQCF_ACT_SKIP|RQCF_REQ_SKIP);
prepare_lock_switch(rq, next, rf);
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index bc0e1cd0d6ac..f3e7dc2cd1cc 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -3354,6 +3354,37 @@ static inline int mm_cid_get(struct mm_struct *mm)
static inline void switch_mm_cid(struct task_struct *prev, struct task_struct *next)
{
+ /*
+ * Provide a memory barrier between rq->curr store and load of
+ * {prev,next}->mm->pcpu_cid[cpu] on rq->curr->mm transition.
+ *
+ * Should be adapted if context_switch() is modified.
+ */
+ if (!next->mm) { // to kernel
+ /*
+ * user -> kernel transition does not guarantee a barrier, but
+ * we can use the fact that it performs an atomic operation in
+ * mmgrab().
+ */
+ if (prev->mm) // from user
+ smp_mb__after_mmgrab();
+ /*
+ * kernel -> kernel transition does not change rq->curr->mm
+ * state. It stays NULL.
+ */
+ } else { // to user
+ /*
+ * kernel -> user transition does not provide a barrier
+ * between rq->curr store and load of {prev,next}->mm->pcpu_cid[cpu].
+ * Provide it here.
+ */
+ if (!prev->mm) // from kernel
+ smp_mb();
+ /*
+ * user -> user transition guarantees a memory barrier through
+ * switch_mm().
+ */
+ }
if (prev->mm_cid_active) {
mm_cid_put_lazy(prev);
prev->mm_cid = -1;
Thanks,
Mathieu
--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com
Powered by blists - more mailing lists