linux-kernel - Re: [PATCH 12/14] sched: Add shared runqueue locking to __task_rq

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20250915083815.GB3289052@noisy.programming.kicks-ass.net>
Date: Mon, 15 Sep 2025 10:38:15 +0200
From: Peter Zijlstra <peterz@...radead.org>
To: Tejun Heo <tj@...nel.org>
Cc: linux-kernel@...r.kernel.org, mingo@...hat.com, juri.lelli@...hat.com,
	vincent.guittot@...aro.org, dietmar.eggemann@....com,
	rostedt@...dmis.org, bsegall@...gle.com, mgorman@...e.de,
	vschneid@...hat.com, longman@...hat.com, hannes@...xchg.org,
	mkoutny@...e.com, void@...ifault.com, arighi@...dia.com,
	changwoo@...lia.com, cgroups@...r.kernel.org,
	sched-ext@...ts.linux.dev, liuwenfang@...or.com, tglx@...utronix.de
Subject: Re: [PATCH 12/14] sched: Add shared runqueue locking to
 __task_rq_lock()

On Fri, Sep 12, 2025 at 07:56:21AM -1000, Tejun Heo wrote:

> It *seems* that way to me. There are two other scenarios tho.
> 
> - A task can move from a non-local DSQ to another non-local DSQ at any time
>   while queued. As this doesn't cause rq migration, we can probably just
>   overwrite p->srq_lock to the new one. Need to think about it a bit more.

It can use task_on_rq_migrating(), exactly like 'normal' rq-to-rq
migration:

	LOCK src_dsq->lock
	p->on_rq = TASK_ON_RQ_MIGRATING;
	task_unlink_from_dsq();
	UNLOCK src_dsq->lock

	LOCK dst_dsq->lock
	dispatch_enqueue()
	p->on_rq = TASK_ON_RQ_QUEUED;
	UNLOCK dst_dsq->lock

Same reasoning as for the pick_task_scx() migration, if it observes
!p->srq_lock, then it must observe MIGRATING and we'll spin-wait until
QUEUED. At which point we'll see the new srq_lock.

> - A task can be queued on a BPF data structure and thus may not be on any
>   DSQ. I think this can be handled by adding a raw_spinlock to task_struct
>   and treating the task as if it's on its own DSQ by pointing to that one,
>   and grabbing that lock when transferring that task from BPF side.

Hmm, and BPF data structures cannot have a lock associated with them?
I'm thinking they must, something is serializing all that.

> So, it *seems* solvable but I'm afraid it's becoming too subtle. How about
> doing something simpler and just add a per-task lock which nests inside rq
> lock which is always grabbed by [__]task_rq_lock() and optionally grabbed by
> sched classes that want to migrate tasks without grabbing the source rq
> lock? That way, we don't need to the lock pointer dancing while achieving
> about the same result. From sched_ext's POV, grabbing that per-task lock is
> likely going to be cheaper than doing the rq lock switching, so it's way
> simpler and nothing gets worse.

I *really* don't like that. Fundamentally a runqueue is 'rich' data
structure. It has a container (list, tree, whatever) but also a pile of
statistics (time, vtime, counts, load-sums, averages). Adding/removing a
task from a runqueue needs all that serialized. A per-task lock simply
cannot do this.

If you've hidden this lock inside BPF such that C cannot access it, then
your abstraction needs fixing. Surely it is possible to have a C DSQ to
mirror whatever the BPF thing does. Add a few helpers for BPF to
create/destroy DSQs (IDs) and a callback to map a task to a DSQ. Then
the C part can use the DSQ lock, and hold it while calling into whatever
BPF.

Additionally, it can sanity check the BPF thing, tasks cannot go
'missing' without C knowing wtf they went -- which is that bypass
problem, no?