[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <57d9f3b5-182a-21ed-528c-7a1ec7dad4ca@efficios.com>
Date: Wed, 5 Apr 2023 10:21:52 -0400
From: Mathieu Desnoyers <mathieu.desnoyers@...icios.com>
To: Peter Zijlstra <peterz@...radead.org>
Cc: linux-kernel@...r.kernel.org, Aaron Lu <aaron.lu@...el.com>
Subject: Re: [RFC PATCH] sched: Fix performance regression introduced by
mm_cid (v2)
On 2023-04-05 08:57, Peter Zijlstra wrote:
> On Wed, Apr 05, 2023 at 08:15:35AM -0400, Mathieu Desnoyers wrote:
>> +/*
>> + * Migration from src cpu. Called from set_task_cpu(). There are no guarantees
>> + * that the rq lock is held.
>> + */
>> +void sched_mm_cid_migrate_from(struct task_struct *t)
>> +{
>> + int src_cid, *src_pcpu_cid, last_mm_cid;
>> + struct mm_struct *mm = t->mm;
>> + struct rq *src_rq;
>> + struct task_struct *src_task;
>> +
>> + if (!mm)
>> + return;
>> +
>> + last_mm_cid = t->last_mm_cid;
>> + /*
>> + * If the migrated task has no last cid, or if the current
>> + * task on src rq uses the cid, it means the destination cpu
>> + * does not have to reallocate its cid to keep the cid allocation
>> + * compact.
>> + */
>> + if (last_mm_cid == -1)
>> + return;
>> +
>> + src_rq = task_rq(t);
>> + src_pcpu_cid = per_cpu_ptr(mm->pcpu_cid, cpu_of(src_rq));
>> + src_cid = READ_ONCE(*src_pcpu_cid);
>> +
>> + if (!mm_cid_is_valid(src_cid) || last_mm_cid != src_cid)
>> + return;
>> +
>> + /*
>> + * If we observe an active task using the mm on this rq, it means we
>> + * are not the last task to be migrated from this cpu for this mm, so
>> + * there is no need to clear the src_cid.
>> + */
>> + rcu_read_lock();
>> + src_task = rcu_dereference(src_rq->curr);
>
> Continuing our discussion from IRC; your concern was if we need a
> barrier near RCU_INIT_POINTER() in __schedule(). Now, typically such a
> site would use rcu_assign_pointer() and be a store-release, which is
> ommitted in this case.
>
> Specifically as commit 5311a98fef7d argues, there's at least one barrier
> in between most fields being set and this assignment.
>
> On top of that, the below only has the ->mm dependent load, and task->mm
> is fairly constant. The obvious exception being kthread_use_mm().
>
>> + if (src_task->mm_cid_active && src_task->mm == mm) {
>> + rcu_read_unlock();
>> + t->last_mm_cid = -1;
>> + return;
>> + }
>> + rcu_read_unlock();
>
> So if we get here, then rq->curr->mm was observed to not match t->mm.
> However, nothing stops the rq from switching to a task that does match
> right here.
>
>> +
>> + /*
>> + * If the source cpu cid is set, and matches the last cid of the
>> + * migrated task, clear the source cpu cid to keep cid allocation
>> + * compact to cover the case where this task is the last task using
>> + * this mm on the source cpu. If there happens to be other tasks left
>> + * on the source cpu using this mm, the next task using this mm will
>> + * reallocate its cid on context switch.
>> + *
>> + * We cannot keep ownership of concurrency ID without runqueue
>> + * lock held when it is not used by a current task, because it
>> + * would lead to allocation of more concurrency ids than there
>> + * are possible cpus in the system. The last_mm_cid is used as
>> + * a hint to conditionally unset the dst cpu cid, keeping
>> + * allocated concurrency ids compact.
>> + */
>> + if (cmpxchg(src_pcpu_cid, src_cid, mm_cid_set_lazy_put(src_cid)) != src_cid)
>> + return;
>
> So we set LAZY, and because that switch above will not observe this
> flag, we must check again:
>
> And if there has indeed been a switch; that CPU will have gone through
> at least one smp_mb() (there's one at the start of __schedule()), so
> either way, it will see the LAZY or we will see the update or both.
>
>> +
>> + /*
>> + * If we observe an active task using the mm on this rq after setting the lazy-put
>> + * flag, this task will be responsible for transitioning from lazy-put
>> + * flag set to MM_CID_UNSET.
>> + */
>> + rcu_read_lock();
>> + src_task = rcu_dereference(src_rq->curr);
>> + if (src_task->mm_cid_active && src_task->mm == mm) {
>> + rcu_read_unlock();
>> + /*
>> + * We observed an active task for this mm, clearing the destination
>> + * cpu mm_cid is not relevant for compactness.
>> + */
>> + t->last_mm_cid = -1;
>> + return;
>> + }
>> + rcu_read_unlock();
>
> It is still unused, so wipe it.
>
>> +
>> + if (cmpxchg(src_pcpu_cid, mm_cid_set_lazy_put(src_cid), MM_CID_UNSET) != mm_cid_set_lazy_put(src_cid))
>> + return;
>> + __mm_cid_put(mm, src_cid);
>> +}
>
> Did I miss any races?
I think your analysis is correct. The full barrier I thought was missing between
store to rq->curr and load of per-mm/cpu cid is indeed in context_switch() in the
next->mm != NULL case:
membarrier_switch_mm(rq, prev->active_mm, next->mm);
/*
* sys_membarrier() requires an smp_mb() between setting
* rq->curr / membarrier_switch_mm() and returning to userspace.
*
* The below provides this either through switch_mm(), or in
* case 'prev->active_mm == next->mm' through
* finish_task_switch()'s mmdrop().
*/
switch_mm_irqs_off(prev->active_mm, next->mm, next);
lru_gen_use_mm(next->mm);
if (!prev->mm) { // from kernel
/* will mmdrop() in finish_task_switch(). */
rq->prev_mm = prev->active_mm;
prev->active_mm = NULL;
}
And in case you would happen to wonder how wrote that big comment about
the barrier, it happens to be me from 2018. :-)
So that barrier takes care of ordering store to rq->curr before load
of per-mm/cpu cid.
And as you said, the barrier at the beginning of __schedule() takes care
of ordering the load of per-mm/cpu cid of the previous context switch with
respect to the next context switch store to rq->curr.
The only other thing I am concerned about is a concurrent update to
t->mm_cid_active of the src rq current task happening concurrently with
sched_mm_cid_migrate_from() migrating a task from that src rq. It is set
to 0 on the current task by sched_mm_cid_exit_signals() and
sched_mm_cid_before_execve(), and set to 1 by sched_mm_cid_after_execve().
In both cases where it is set to 0, there is also a mm_cid_put() which
releases the cid (same effect as what we aim to do in migrate-from). But
I think we need additional barriers here, e.g. clearing t->mm_cid_active
should be done before calling mm_cid_put(t), so a concurrent migrate-from
does not set a lazy-flag that will never be handled. We should also add
a barrier between mm_cid_get() and setting t->mm_cid_active to 1 in
sched_mm_cid_after_execve.
Thoughts ?
Thanks,
Mathieu
--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com
Powered by blists - more mailing lists