linux-kernel - Re: [RFC PATCH v3] sched: Fix performance regression introduced by mm

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <fdaa7242-4ddd-fbe2-bc0e-6c62054dbde8@efficios.com>
Date:   Fri, 7 Apr 2023 19:50:42 -0400
From:   Mathieu Desnoyers <mathieu.desnoyers@...icios.com>
To:     Peter Zijlstra <peterz@...radead.org>
Cc:     linux-kernel@...r.kernel.org, Aaron Lu <aaron.lu@...el.com>,
        Olivier Dion <odion@...icios.com>
Subject: Re: [RFC PATCH v3] sched: Fix performance regression introduced by
 mm_cid

On 2023-04-06 05:51, Peter Zijlstra wrote:
> On Wed, Apr 05, 2023 at 04:37:42PM -0400, Mathieu Desnoyers wrote:
>> On 2023-04-05 12:26, Mathieu Desnoyers wrote:
>> [...]
>>>    #ifdef CONFIG_SCHED_MM_CID
>>> +/*
>>> + * Migration from src cpu. Called from set_task_cpu(). There are no guarantees
>>> + * that the rq lock is held.
>>> + */
>>> +void sched_mm_cid_migrate_from(struct task_struct *t)
>>> +{
>>> +	int src_cid, *src_pcpu_cid, last_mm_cid;
>>> +	struct mm_struct *mm = t->mm;
>>> +	struct rq *src_rq;
>>> +	struct task_struct *src_task;
>>> +
>>> +	if (!mm)
>>> +		return;
>>> +
>>> +	last_mm_cid = t->last_mm_cid;
>>> +	/*
>>> +	 * If the migrated task has no last cid, or if the current
>>> +	 * task on src rq uses the cid, it means the destination cpu
>>> +	 * does not have to reallocate its cid to keep the cid allocation
>>> +	 * compact.
>>> +	 */
>>> +	if (last_mm_cid == -1)
>>> +		return;
>>> +
>>> +	src_rq = task_rq(t);
>>> +	src_pcpu_cid = per_cpu_ptr(mm->pcpu_cid, cpu_of(src_rq));
>>> +	src_cid = READ_ONCE(*src_pcpu_cid);
>>> +
>>> +	if (!mm_cid_is_valid(src_cid) || last_mm_cid != src_cid)
>>> +		return;
>>> +
>>> +	/*
>>> +	 * If we observe an active task using the mm on this rq, it means we
>>> +	 * are not the last task to be migrated from this cpu for this mm, so
>>> +	 * there is no need to clear the src_cid.
>>> +	 */
>>> +	rcu_read_lock();
>>> +	src_task = rcu_dereference(src_rq->curr);
>>> +	if (src_task->mm_cid_active && src_task->mm == mm) {
>>> +		rcu_read_unlock();
>>> +		t->last_mm_cid = -1;
>>> +		return;
>>> +	}
>>> +	rcu_read_unlock();
>>> +
>>> +	/*
>>> +	 * If the source cpu cid is set, and matches the last cid of the
>>> +	 * migrated task, clear the source cpu cid to keep cid allocation
>>> +	 * compact to cover the case where this task is the last task using
>>> +	 * this mm on the source cpu. If there happens to be other tasks left
>>> +	 * on the source cpu using this mm, the next task using this mm will
>>> +	 * reallocate its cid on context switch.
>>> +	 *
>>> +	 * We cannot keep ownership of concurrency ID without runqueue
>>> +	 * lock held when it is not used by a current task, because it
>>> +	 * would lead to allocation of more concurrency ids than there
>>> +	 * are possible cpus in the system. The last_mm_cid is used as
>>> +	 * a hint to conditionally unset the dst cpu cid, keeping
>>> +	 * allocated concurrency ids compact.
>>> +	 */
>>> +	if (cmpxchg(src_pcpu_cid, src_cid, mm_cid_set_lazy_put(src_cid)) != src_cid)
>>> +		return;
>>> +
>>> +	/*
>>> +	 * The implicit barrier after cmpxchg per-mm/cpu cid before loading
>>> +	 * rq->curr->mm matches the scheduler barrier in context_switch()
>>> +	 * between store to rq->curr and load of prev and next task's
>>> +	 * per-mm/cpu cid.
>>
>> Thinking about it further, I suspect the transition:
>>
>>           *   user -> kernel   lazy + mmgrab() active
>>
>> in context_switch() lacks a memory barrier we need here (it's only an
>> atomic_inc with mmgrab()).
>>
>> The scenario is a transition from user to kernel thread happening
>> concurrently with migrate-from.
>>
>> Because there is no memory barrier between set rq->curr to a value that
>> has a NULL mm and loading per-mm/cpu cid value in mm_cid_put_lazy() for the
>> prev task, nothing guarantees that the following src_rq->curr rcu
>> dereference here will observe the store.
>>
>> Or am I missing something ?
> 
> OK, so last night I thought it was me needing sleep (which might still
> be the case), but now I'm still not quite getting it. Let me draw a
> picture, perhaps that'll help...

Let me try to start from your explanation to expose the case I have in mind.

> 
> It reads to me like you're treating a barrier as 'makes visible', which
> it never will.

No. What I am trying to figure out here is how we can introduce a Dekker
memory ordering (ref. row 10 from table at https://lwn.net/Articles/573436/)
to guarantee that the loads observe at least one of the stores:

Taken from kernel/futex/waitwake.c:

  *      X = Y = 0
  *
  *      w[X]=1          w[Y]=1
  *      MB              MB
  *      r[Y]=y          r[X]=x
  *
  * Which guarantees that x==0 && y==0 is impossible [...]

But let's first go back to the basic properties we want to guarantee before we
discuss how to actually guarantee them without holding the runqueue lock, and
ideally without additional atomic operations or barriers on the scheduler
fast-path.

There are two basic properties we want is to guarantee here:

(1) Migrate-from should _never_ mark a per-cpu cid UNSET when it is actively used
     by a task. That would lead to concurrent allocation of the cid and userspace
     corruption.

(2) Migrate-from should ideally not leak a cid value when there are no more tasks
     using it on the cpu. This ensures we keep cid allocations as compact as possible
     (closest to 0). Getting this wrong will not cause userspace corruption, just a
     less compact id allocation.

> 
> 
> CPU0 will run a user task A and switch to a kernel thread K;
> CPU1 will concurrently run sched_mm_cid_migrate_from().
> 
> 
> 	CPU0				CPU1
> 
> 	.. runs A ..
> 
> 					if (src_task->mm_cid_active && src_task->mm == mm)
> 					  // false, proceed
> 
> 					cmpxchg(); // set LAZY
> 
> 	__schedule()
> 					(A)
> 	  rq->curr = K;
> 					(B)
> 	  context_switch()
> 	    prepare_task_switch()

Please excuse my earlier emails, I did get the order between store to rq->curr,
the barriers in context_switch(), and prepare_task_switch() wrong. The documented
barriers in context_switch() are _after_ prepare_task_switch(), not before. So a
part of my earlier explanations do not make sense because of this.

We are in a situation where there are no barriers between store to rq->curr and the
call to prepare_task_switch() that follows just after.

> 	      switch_mm_cid()
> 	        cid_put_lazy(prev)
> 		cid_get(next)
>              mmgrab(); // user->kernel active_mm
> 	    switch_to();
> 
> 
> 
> 	.. runs K ..
> 
> 
> And it is this second compare that can either observe A or K, right?
> That is the locations A or B above. Later doesn't make sense because
> then switch_mm_cid() will have taken care of things.
> 
> If A, still false, and we continue to mark UNSET and put.
> If B, we see K, which will not have mm_cid_active set so still false and
> same as above.
> 
> What am I missing?

Let's introduce task (Y) which has mm == src_task->mm and task (N) which has
mm != src_task->mm for the rest of the discussion.

Let's consider the scheduling state transitions we want to consider here.
There are two scheduler state transitions on context switch we care about:

(TSA) Store to rq->curr with transition from (N) to (Y)

(TSB) Store to rq->curr with transition from (Y) to (N)

On the migrate-from side, there is one transition we care about:

(TMA) cmpxchg to *pcpu_cid to set the LAZY flag

There is also a transition to UNSET state which can be performed from all
sides (scheduler, migrate-from). It is performed with a cmpxchg everywhere
which guarantees that only a single thread will succeed:

(TMB) cmpxchg to *pcpu_cid to mark UNSET

Just to be clear (at the risk of repeating myself), what we do _not_ want
to happen is a transition to UNSET when a thread is actively using the cid
(property (1)). And ideally we do not want to leak a cid after migrating the
last task from a cpu (property (2)).

Let's looks at the relevant combinations of TSA/TSB, and TMA transitions.

Scenario A) (TSA)+(TMA) (from next task perspective)

CPU0                                                 CPU1

Context switch CS-1                                  Migrate-from
   - store to rq->curr: (N)->(Y) (TSA)                - cmpxchg to *pcpu_id to LAZY (TMA)
      *** missing barrier ?? ***                        (implied barrier after cmpxchg)
   - prepare_task_switch()
     - switch_mm_cid()
       - mm_cid_get (next)
         - READ_ONCE(*pcpu_cid)                       - rcu_dereference(src_rq->curr)

This Dekker ensures that either task (Y) is observed by the rcu_dereference() or the LAZY
flag is observed by READ_ONCE(), or both are observed.

If task (Y) store is observed by rcu_dereference(), it means that there is still
an active task on the cpu. Migrate-from will therefore not transition to UNSET, which
fulfills property (1). That observed task will itself eventually need a migrate-from
to be migrated away from that cpu, which fulfills property (2).

If task (Y) is not observed, but the lazy flag is observed by READ_ONCE(), it will
move its state to UNSET, which clears the percpu cid perhaps uselessly (which is not
an issue for correctness). Because task (Y) is not observed, CPU1 can move ahead to
set the state to UNSET. Because moving state to UNSET is done with a cmpxchg expecting
that the old state has the LAZY flag set, only one thread will successfully UNSET.

If both states (LAZY flag and task (Y)) are observed, the thread on CPU0 will observe
the LAZY flag and transition to UNSET (perhaps uselessly), and CPU1 will observe task
(Y) and do nothing more, which is fine.

What we are effectively preventing with this Dekker is a scenario where neither LAZY
flag nor store (Y) are observed, which would fail property (1) because this would
UNSET a cid which is actively used.


Scenario B) (TSB)+(TMA) (from prev task perspective)

CPU0                                                 CPU1

Context switch CS-1                                  Migrate-from
   - store to rq->curr: (Y)->(N) (TSB)                - cmpxchg to *pcpu_id to LAZY (TMA)
     *** missing barrier ?? ***                         (implied barrier after cmpxchg)
   - prepare_task_switch()
     - switch_mm_cid()
       - cid_put_lazy() (prev)
         - READ_ONCE(*pcpu_cid)                       - rcu_dereference(src_rq->curr)

This Dekker ensures that either task (N) is observed by the rcu_dereference() or the LAZY
flag is observed by READ_ONCE(), or both are observed.

If rcu_dereference observes (N) but LAZY is not observed, migrate-from will take care to
advance the state to UNSET, thus fulfilling property (2). Because (Y) is not running anymore
property (1) is fulfilled.

If rcu_dereference does not observe (N), but LAZY is observed, migrate-from does not
advance to UNSET because it observes (Y), but LAZY flag will make the task on CPU0
take care of advancing the state to UNSET, thus fulfilling property (2).

If both (N) and LAZY are observed, both migrate-from and CPU0 will try to advance the
state to UNSET, but only one will succeed its cmpxchg.

What we are effectively preventing with this Dekker is a scenario where neither LAZY
flag nor store (N) are observed, which would fail property (2) because it would leak
a cid on a cpu that has no task left using the mm.

So based on my analysis, we are indeed missing a barrier between store to rq->curr and
load of the per-mm/cpu cid within context_switch().

Am I missing something ? How can we provide this barrier with minimal overhead ?

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com