[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <aYOIRZosqGk-k3l-@gpd4>
Date: Wed, 4 Feb 2026 18:56:21 +0100
From: Andrea Righi <arighi@...dia.com>
To: Kuba Piecuch <jpiecuch@...gle.com>
Cc: Tejun Heo <tj@...nel.org>, David Vernet <void@...ifault.com>,
Changwoo Min <changwoo@...lia.com>,
Christian Loehle <christian.loehle@....com>,
Emil Tsalapatis <emil@...alapatis.com>,
Daniel Hodges <hodgesd@...a.com>, sched-ext@...ts.linux.dev,
linux-kernel@...r.kernel.org
Subject: Re: [PATCH] sched_ext: Invalidate dispatch decisions on CPU affinity
changes
On Wed, Feb 04, 2026 at 04:58:47PM +0000, Kuba Piecuch wrote:
> On Wed Feb 4, 2026 at 3:36 PM UTC, Andrea Righi wrote:
> >> >
> >> > When finish_dispatch() detects a qseq mismatch, the dispatch is dropped
> >> > and the task is returned to the SCX_OPSS_QUEUED state, allowing it to be
> >> > re-dispatched using up-to-date affinity information.
> >>
> >> How will the scheduler know that the dispatch was dropped? Is the scheduler
> >> expected to infer it from the ops.enqueue() that follows set_cpus_allowed_scx()
> >> on CPU1?
> >
> > The idea was that, if the dispatch is dropped, we'll see another
> > ops.enqueue() for the task, so at least the task is not "lost" and the
> > BPF scheduler gets another chance what to do with it. In this case it'd be
> > useful to set SCX_ENQ_REENQ (or a dedicated special flag) to indicate that
> > the enqueue resulted from a dropped dispatch.
>
> I think SCX_ENQ_REENQ is enough for now, we can always add a dedicated flag
> if a need for it arises.
>
> I still worry about the scenario you described. In particular, I think it can
> lead to tasks being forgotten (i.e. not re-enqueued) after a failed dispatch.
>
> CPU0 CPU1
> ---- ----
> if (cpumask_test_cpu(cpu, p->cpus_ptr))
> task_rq_lock(p)
> dequeue_task_scx(p, ...)
> (remove p from internal queues)
> set_cpus_allowed_scx(p, new_mask)
> enqueue_task_scx(p, ...)
> (add p to internal queues)
> task_rq_unlock(p)
> (remove p from internal queues)
> scx_bpf_dsq_insert(p,
> SCX_DSQ_LOCAL_ON | cpu, 0)
>
> In this scenario, the ops.enqueue() which is supposed to notify the BPF
> scheduler about the failed dispatch actually happens _before_ the actual
> dispatch, so once the dispatch fails, the task won't be re-enqueued.
>
> There are two problems here:
>
> 1. CPU0 makes a scheduling decision based on stale data and it isn't detected.
> 2. Even if it is detected and the dispatch aborted, the task won't be
> re-enqueued.
Right. At this point I think we can just rely on the affinity validation
via task_can_run_on_remote_rq(), where p->cpus_ptr is always stable and
just drop invalid dispatches.
And to prevent dropped tasks, I was wondering if we could just insert the
task into a per-rq fallback DSQ, that can be consumed from balance_scx() to
re-enqueue the task (setting SCX_ENQ_REENQ). This should solve the
re-enqueue problem avoiding the locking complexity of calling ops.enqueue()
directly from finish_dispatch().
Thoughts?
>
> The way we deal with the first problem in ghOSt (Google's equivalent of
> sched_ext) is we expose the per-task sequence number to the BPF scheduler.
> On the dispatch path, when the BPF scheduler has a candidate task,
> it retrieves its seqnum, re-checks the task state to ensure that it's still
> eligible for dispatch, and passes the seqnum to the kernel's dispatch helper
> for verification. If the kernel detects that the seqnum has changed already,
> it synchronously fails the dispatch attempt (dispatch always happens
> synchronously in ghOSt). In sched_ext, we could do the synchronous check, but
> we also need to do the same check later in finish_dispatch(), comparing
> the current qseq against the qseq passed by the BPF scheduler.
>
> To fix the second problem, we would need to explicitly call ops.enqueue()
> from finish_dispatch() and the other places where we abort dispatch if the
> qseq is out of date.
>
> Either that, or just add locking to the BPF scheduler to prevent the race from
> happening in the first place.
Thanks,
-Andrea
Powered by blists - more mailing lists