linux-kernel - Re: [PATCH] sched_ext: Invalidate dispatch decisions on CPU affinity changes

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <aYOIRZosqGk-k3l-@gpd4>
Date: Wed, 4 Feb 2026 18:56:21 +0100
From: Andrea Righi <arighi@...dia.com>
To: Kuba Piecuch <jpiecuch@...gle.com>
Cc: Tejun Heo <tj@...nel.org>, David Vernet <void@...ifault.com>,
	Changwoo Min <changwoo@...lia.com>,
	Christian Loehle <christian.loehle@....com>,
	Emil Tsalapatis <emil@...alapatis.com>,
	Daniel Hodges <hodgesd@...a.com>, sched-ext@...ts.linux.dev,
	linux-kernel@...r.kernel.org
Subject: Re: [PATCH] sched_ext: Invalidate dispatch decisions on CPU affinity
 changes

On Wed, Feb 04, 2026 at 04:58:47PM +0000, Kuba Piecuch wrote:
> On Wed Feb 4, 2026 at 3:36 PM UTC, Andrea Righi wrote:
> >> >
> >> > When finish_dispatch() detects a qseq mismatch, the dispatch is dropped
> >> > and the task is returned to the SCX_OPSS_QUEUED state, allowing it to be
> >> > re-dispatched using up-to-date affinity information.
> >> 
> >> How will the scheduler know that the dispatch was dropped? Is the scheduler
> >> expected to infer it from the ops.enqueue() that follows set_cpus_allowed_scx()
> >> on CPU1?
> >
> > The idea was that, if the dispatch is dropped, we'll see another
> > ops.enqueue() for the task, so at least the task is not "lost" and the
> > BPF scheduler gets another chance what to do with it. In this case it'd be
> > useful to set SCX_ENQ_REENQ (or a dedicated special flag) to indicate that
> > the enqueue resulted from a dropped dispatch.
> 
> I think SCX_ENQ_REENQ is enough for now, we can always add a dedicated flag
> if a need for it arises.
> 
> I still worry about the scenario you described. In particular, I think it can
> lead to tasks being forgotten (i.e. not re-enqueued) after a failed dispatch.
> 
>   CPU0                                      CPU1
>   ----                                      ----
>   if (cpumask_test_cpu(cpu, p->cpus_ptr))
>                                             task_rq_lock(p)
>                                             dequeue_task_scx(p, ...)
>                                               (remove p from internal queues)
>                                             set_cpus_allowed_scx(p, new_mask)
>                                             enqueue_task_scx(p, ...)
>                                               (add p to internal queues)
>                                             task_rq_unlock(p)
>       (remove p from internal queues)
>       scx_bpf_dsq_insert(p,
>               SCX_DSQ_LOCAL_ON | cpu, 0)
> 
> In this scenario, the ops.enqueue() which is supposed to notify the BPF
> scheduler about the failed dispatch actually happens _before_ the actual
> dispatch, so once the dispatch fails, the task won't be re-enqueued.
> 
> There are two problems here:
> 
> 1. CPU0 makes a scheduling decision based on stale data and it isn't detected.
> 2. Even if it is detected and the dispatch aborted, the task won't be
>    re-enqueued.

Right. At this point I think we can just rely on the affinity validation
via task_can_run_on_remote_rq(), where p->cpus_ptr is always stable and
just drop invalid dispatches.

And to prevent dropped tasks, I was wondering if we could just insert the
task into a per-rq fallback DSQ, that can be consumed from balance_scx() to
re-enqueue the task (setting SCX_ENQ_REENQ). This should solve the
re-enqueue problem avoiding the locking complexity of calling ops.enqueue()
directly from finish_dispatch().

Thoughts?

> 
> The way we deal with the first problem in ghOSt (Google's equivalent of
> sched_ext) is we expose the per-task sequence number to the BPF scheduler.
> On the dispatch path, when the BPF scheduler has a candidate task,
> it retrieves its seqnum, re-checks the task state to ensure that it's still
> eligible for dispatch, and passes the seqnum to the kernel's dispatch helper
> for verification. If the kernel detects that the seqnum has changed already,
> it synchronously fails the dispatch attempt (dispatch always happens
> synchronously in ghOSt). In sched_ext, we could do the synchronous check, but
> we also need to do the same check later in finish_dispatch(), comparing
> the current qseq against the qseq passed by the BPF scheduler.
> 
> To fix the second problem, we would need to explicitly call ops.enqueue()
> from finish_dispatch() and the other places where we abort dispatch if the
> qseq is out of date.
> 
> Either that, or just add locking to the BPF scheduler to prevent the race from
> happening in the first place.

Thanks,
-Andrea