linux-kernel - Re: [PATCH 12/14] sched: Add shared runqueue locking to __task_rq

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <aMRexZ_SIUVgkIpZ@slm.duckdns.org>
Date: Fri, 12 Sep 2025 07:56:21 -1000
From: Tejun Heo <tj@...nel.org>
To: Peter Zijlstra <peterz@...radead.org>
Cc: linux-kernel@...r.kernel.org, mingo@...hat.com, juri.lelli@...hat.com,
	vincent.guittot@...aro.org, dietmar.eggemann@....com,
	rostedt@...dmis.org, bsegall@...gle.com, mgorman@...e.de,
	vschneid@...hat.com, longman@...hat.com, hannes@...xchg.org,
	mkoutny@...e.com, void@...ifault.com, arighi@...dia.com,
	changwoo@...lia.com, cgroups@...r.kernel.org,
	sched-ext@...ts.linux.dev, liuwenfang@...or.com, tglx@...utronix.de
Subject: Re: [PATCH 12/14] sched: Add shared runqueue locking to
 __task_rq_lock()

Hello,

On Fri, Sep 12, 2025 at 01:54:59PM +0200, Peter Zijlstra wrote:
> > With the !slock condition, the following scenario is possible:
> > 
> >   __task_rq_lock()
> >      slock = p->srq_lock; /* NULL */
> >                                                 dispatch_enqueue()
> >                                                   p->srq_lock = &dsq->lock;
> >                                                 enqueue finishes
> >      raw_spin_rq_lock(rq);
> >      rq is the same, $slock is NULL, return
> >   do something assuming p is locked down        p gets dispatched to another rq
> > 
> > I'm unclear on when p->srq_lock would be safe to set and clear, so the goal
> > is that whoever does [__]task_rq_lock() ends up waiting on the dsq lock that
> > the task is queued on, and if we can exclude other sched operations that
> > way, we don't have to hold source rq lock when moving the task to another rq
> > for execution, right?
> 
> Indeed. If !p->srq_lock then task_rq(p)->lock must be sufficient.
> 
> So for enqueue, which sets p->srq_lock, this must be done while holding
> task_rq(p)->lock.
> 
> So the above example should be serialized on task_rq(p)->lock, since
> __task_rq_lock() holds it, enqueue cannot happen. Conversely, if enqueue
> holds task_rq(p)->lock, then __task_rq_lock() will have to wait for
> that, and then observe the newly set p->srq_lock and cycle to take that.

For that to work, [__]task_rq_lock() would have to read p->srq_lock while
holdling rq_lock. Simple reordering, but yeah it'd help to have
setter/getter for explicit locking rules.

...
> We must do set_task_cpu(0) before task_unlink_from_dsq() (and I got this
> order wrong in yesterday's email).
> 
>   pick_task_ext() on CPU0		
>     lock DSQ
>     pick p
>     set_task_cpu(0)			task_rq_lock()
>     task_unlink_from_dsq()		  if !p->srq_lock, then task_rq(p) == 0
>       p->srq_lock = NULL;
>     p is moved to local DSQ
> 
> Perhaps the p->srq_lock store should be store-release, so that the cpu
> store is before.
> 
> Then if we observe p->srq_lock, we'll serialize against DSQ and all is
> well, if we observe !p->srq_lock then we must also observe task_rq(p) ==
> 0 and then we'll serialize on rq->lock.

I see, so the interlocking is between task_rq(p) and p->srq_lock - either
one sees the updated CPU or non-NULL srq_lock. As long as the one that
clears ->srq_lock has both the destination rq and DSQ locked, task_rq_lock()
either ends up waiting on ->srq_lock or sees updated CPU and has to loop
over and wait on the destination rq.

> Now let me see if there isn't an ABA issue here, consider:
> 
> pre: task_cpu(p) != 2, p->srq_lock = NULL
> 
>   CPU0				CPU1				CPU2
> 
>   __task_rq_lock()		enqueue_task_scx() 		pick_task_scx()
> 
> 				rq = task_rq(p);
> 				LOCK rq->lock
>   rq = task_rq(p)
>   LOCK rq->lock
>     .. waits
> 				LOCK dsq->lock
> 				enqueue on dsq
> 				p->srq_lock = &dsq->lock	
> 				UNLOCK dsq->lock
> 								LOCK dsq->lock
> 								pick p
> 				UNLOCK rq->lock
> 								set_task_cpu(2)
> 								task_unlink_from_dsq()
> 								  p->srq_lock = NULL;
> 								UNLOCK dsq->lock
>     .. resumes
>  
> At this point our CPU0's __task_rq_lock():
> 
>   - if it observes p->srq_lock, it will cycle taking that, only to then
>     find out p->srq_lock is no longer set, but then it must also see
>     task_rq() has changed, so the next cycle will block on CPU2's
>     rq->lock.
> 
>   - if it observes !p->srq_lock, then it cannot be the initial NULL,
>     since the initial task_rq(p)->lock ordering prohibits this. So it
>     must be the second NULL, which then also mandates we see the CPU
>     change and we'll cycle to take CPU2's rq->lock.
> 
> That is, I _think_ we're okay :-)

It *seems* that way to me. There are two other scenarios tho.

- A task can move from a non-local DSQ to another non-local DSQ at any time
  while queued. As this doesn't cause rq migration, we can probably just
  overwrite p->srq_lock to the new one. Need to think about it a bit more.

- A task can be queued on a BPF data structure and thus may not be on any
  DSQ. I think this can be handled by adding a raw_spinlock to task_struct
  and treating the task as if it's on its own DSQ by pointing to that one,
  and grabbing that lock when transferring that task from BPF side.

So, it *seems* solvable but I'm afraid it's becoming too subtle. How about
doing something simpler and just add a per-task lock which nests inside rq
lock which is always grabbed by [__]task_rq_lock() and optionally grabbed by
sched classes that want to migrate tasks without grabbing the source rq
lock? That way, we don't need to the lock pointer dancing while achieving
about the same result. From sched_ext's POV, grabbing that per-task lock is
likely going to be cheaper than doing the rq lock switching, so it's way
simpler and nothing gets worse.

Thanks.

-- 
tejun