linux-kernel - Re: [RFC PATCH v3] sched: Fix performance regression introduced by mm

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20230406095122.GF386572@hirez.programming.kicks-ass.net>
Date:   Thu, 6 Apr 2023 11:51:22 +0200
From:   Peter Zijlstra <peterz@...radead.org>
To:     Mathieu Desnoyers <mathieu.desnoyers@...icios.com>
Cc:     linux-kernel@...r.kernel.org, Aaron Lu <aaron.lu@...el.com>
Subject: Re: [RFC PATCH v3] sched: Fix performance regression introduced by
 mm_cid

On Wed, Apr 05, 2023 at 04:37:42PM -0400, Mathieu Desnoyers wrote:
> On 2023-04-05 12:26, Mathieu Desnoyers wrote:
> [...]
> >   #ifdef CONFIG_SCHED_MM_CID
> > +/*
> > + * Migration from src cpu. Called from set_task_cpu(). There are no guarantees
> > + * that the rq lock is held.
> > + */
> > +void sched_mm_cid_migrate_from(struct task_struct *t)
> > +{
> > +	int src_cid, *src_pcpu_cid, last_mm_cid;
> > +	struct mm_struct *mm = t->mm;
> > +	struct rq *src_rq;
> > +	struct task_struct *src_task;
> > +
> > +	if (!mm)
> > +		return;
> > +
> > +	last_mm_cid = t->last_mm_cid;
> > +	/*
> > +	 * If the migrated task has no last cid, or if the current
> > +	 * task on src rq uses the cid, it means the destination cpu
> > +	 * does not have to reallocate its cid to keep the cid allocation
> > +	 * compact.
> > +	 */
> > +	if (last_mm_cid == -1)
> > +		return;
> > +
> > +	src_rq = task_rq(t);
> > +	src_pcpu_cid = per_cpu_ptr(mm->pcpu_cid, cpu_of(src_rq));
> > +	src_cid = READ_ONCE(*src_pcpu_cid);
> > +
> > +	if (!mm_cid_is_valid(src_cid) || last_mm_cid != src_cid)
> > +		return;
> > +
> > +	/*
> > +	 * If we observe an active task using the mm on this rq, it means we
> > +	 * are not the last task to be migrated from this cpu for this mm, so
> > +	 * there is no need to clear the src_cid.
> > +	 */
> > +	rcu_read_lock();
> > +	src_task = rcu_dereference(src_rq->curr);
> > +	if (src_task->mm_cid_active && src_task->mm == mm) {
> > +		rcu_read_unlock();
> > +		t->last_mm_cid = -1;
> > +		return;
> > +	}
> > +	rcu_read_unlock();
> > +
> > +	/*
> > +	 * If the source cpu cid is set, and matches the last cid of the
> > +	 * migrated task, clear the source cpu cid to keep cid allocation
> > +	 * compact to cover the case where this task is the last task using
> > +	 * this mm on the source cpu. If there happens to be other tasks left
> > +	 * on the source cpu using this mm, the next task using this mm will
> > +	 * reallocate its cid on context switch.
> > +	 *
> > +	 * We cannot keep ownership of concurrency ID without runqueue
> > +	 * lock held when it is not used by a current task, because it
> > +	 * would lead to allocation of more concurrency ids than there
> > +	 * are possible cpus in the system. The last_mm_cid is used as
> > +	 * a hint to conditionally unset the dst cpu cid, keeping
> > +	 * allocated concurrency ids compact.
> > +	 */
> > +	if (cmpxchg(src_pcpu_cid, src_cid, mm_cid_set_lazy_put(src_cid)) != src_cid)
> > +		return;
> > +
> > +	/*
> > +	 * The implicit barrier after cmpxchg per-mm/cpu cid before loading
> > +	 * rq->curr->mm matches the scheduler barrier in context_switch()
> > +	 * between store to rq->curr and load of prev and next task's
> > +	 * per-mm/cpu cid.
> 
> Thinking about it further, I suspect the transition:
> 
>          *   user -> kernel   lazy + mmgrab() active
> 
> in context_switch() lacks a memory barrier we need here (it's only an
> atomic_inc with mmgrab()).
> 
> The scenario is a transition from user to kernel thread happening
> concurrently with migrate-from.
> 
> Because there is no memory barrier between set rq->curr to a value that
> has a NULL mm and loading per-mm/cpu cid value in mm_cid_put_lazy() for the
> prev task, nothing guarantees that the following src_rq->curr rcu
> dereference here will observe the store.
> 
> Or am I missing something ?

OK, so last night I thought it was me needing sleep (which might still
be the case), but now I'm still not quite getting it. Let me draw a
picture, perhaps that'll help...

It reads to me like you're treating a barrier as 'makes visible', which
it never will.


CPU0 will run a user task A and switch to a kernel thread K;
CPU1 will concurrently run sched_mm_cid_migrate_from().


	CPU0				CPU1

	.. runs A ..

					if (src_task->mm_cid_active && src_task->mm == mm)
					  // false, proceed

					cmpxchg(); // set LAZY

	__schedule()
					(A)
	  rq->curr = K;
					(B)
	  context_switch()
	    prepare_task_switch()
	      switch_mm_cid()
	        cid_put_lazy(prev)
		cid_get(next)
            mmgrab(); // user->kernel active_mm
	    switch_to();



	.. runs K ..


And it is this second compare that can either observe A or K, right?
That is the locations A or B above. Later doesn't make sense because
then switch_mm_cid() will have taken care of things.

If A, still false, and we continue to mark UNSET and put.
If B, we see K, which will not have mm_cid_active set so still false and
same as above.

What am I missing?