linux-kernel - Re: [RFC PATCH] sched: Fix performance regression introduced by mm

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20230405125755.GB351571@hirez.programming.kicks-ass.net>
Date:   Wed, 5 Apr 2023 14:57:55 +0200
From:   Peter Zijlstra <peterz@...radead.org>
To:     Mathieu Desnoyers <mathieu.desnoyers@...icios.com>
Cc:     linux-kernel@...r.kernel.org, Aaron Lu <aaron.lu@...el.com>
Subject: Re: [RFC PATCH] sched: Fix performance regression introduced by
 mm_cid (v2)

On Wed, Apr 05, 2023 at 08:15:35AM -0400, Mathieu Desnoyers wrote:
> +/*
> + * Migration from src cpu. Called from set_task_cpu(). There are no guarantees
> + * that the rq lock is held.
> + */
> +void sched_mm_cid_migrate_from(struct task_struct *t)
> +{
> +	int src_cid, *src_pcpu_cid, last_mm_cid;
> +	struct mm_struct *mm = t->mm;
> +	struct rq *src_rq;
> +	struct task_struct *src_task;
> +
> +	if (!mm)
> +		return;
> +
> +	last_mm_cid = t->last_mm_cid;
> +	/*
> +	 * If the migrated task has no last cid, or if the current
> +	 * task on src rq uses the cid, it means the destination cpu
> +	 * does not have to reallocate its cid to keep the cid allocation
> +	 * compact.
> +	 */
> +	if (last_mm_cid == -1)
> +		return;
> +
> +	src_rq = task_rq(t);
> +	src_pcpu_cid = per_cpu_ptr(mm->pcpu_cid, cpu_of(src_rq));
> +	src_cid = READ_ONCE(*src_pcpu_cid);
> +
> +	if (!mm_cid_is_valid(src_cid) || last_mm_cid != src_cid)
> +		return;
> +
> +	/*
> +	 * If we observe an active task using the mm on this rq, it means we
> +	 * are not the last task to be migrated from this cpu for this mm, so
> +	 * there is no need to clear the src_cid.
> +	 */
> +	rcu_read_lock();
> +	src_task = rcu_dereference(src_rq->curr);

Continuing our discussion from IRC; your concern was if we need a
barrier near RCU_INIT_POINTER() in __schedule(). Now, typically such a
site would use rcu_assign_pointer() and be a store-release, which is
ommitted in this case.

Specifically as commit 5311a98fef7d argues, there's at least one barrier
in between most fields being set and this assignment.

On top of that, the below only has the ->mm dependent load, and task->mm
is fairly constant. The obvious exception being kthread_use_mm().

> +	if (src_task->mm_cid_active && src_task->mm == mm) {
> +		rcu_read_unlock();
> +		t->last_mm_cid = -1;
> +		return;
> +	}
> +	rcu_read_unlock();

So if we get here, then rq->curr->mm was observed to not match t->mm.
However, nothing stops the rq from switching to a task that does match
right here.

> +
> +	/*
> +	 * If the source cpu cid is set, and matches the last cid of the
> +	 * migrated task, clear the source cpu cid to keep cid allocation
> +	 * compact to cover the case where this task is the last task using
> +	 * this mm on the source cpu. If there happens to be other tasks left
> +	 * on the source cpu using this mm, the next task using this mm will
> +	 * reallocate its cid on context switch.
> +	 *
> +	 * We cannot keep ownership of concurrency ID without runqueue
> +	 * lock held when it is not used by a current task, because it
> +	 * would lead to allocation of more concurrency ids than there
> +	 * are possible cpus in the system. The last_mm_cid is used as
> +	 * a hint to conditionally unset the dst cpu cid, keeping
> +	 * allocated concurrency ids compact.
> +	 */
> +	if (cmpxchg(src_pcpu_cid, src_cid, mm_cid_set_lazy_put(src_cid)) != src_cid)
> +		return;

So we set LAZY, and because that switch above will not observe this
flag, we must check again:

And if there has indeed been a switch; that CPU will have gone through
at least one smp_mb() (there's one at the start of __schedule()), so
either way, it will see the LAZY or we will see the update or both.

> +
> +	/*
> +	 * If we observe an active task using the mm on this rq after setting the lazy-put
> +	 * flag, this task will be responsible for transitioning from lazy-put
> +	 * flag set to MM_CID_UNSET.
> +	 */
> +	rcu_read_lock();
> +	src_task = rcu_dereference(src_rq->curr);
> +	if (src_task->mm_cid_active && src_task->mm == mm) {
> +		rcu_read_unlock();
> +		/*
> +		 * We observed an active task for this mm, clearing the destination
> +		 * cpu mm_cid is not relevant for compactness.
> +		 */
> +		t->last_mm_cid = -1;
> +		return;
> +	}
> +	rcu_read_unlock();

It is still unused, so wipe it.

> +
> +	if (cmpxchg(src_pcpu_cid, mm_cid_set_lazy_put(src_cid), MM_CID_UNSET) != mm_cid_set_lazy_put(src_cid))
> +		return;
> +	__mm_cid_put(mm, src_cid);
> +}

Did I miss any races?