linux-kernel - Re: [PATCH] sched_ext: Invalidate dispatch decisions on CPU affinity changes

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <DG6C5HB3PHH3.2JRZX83QMLK2X@google.com>
Date: Wed, 04 Feb 2026 16:58:47 +0000
From: Kuba Piecuch <jpiecuch@...gle.com>
To: Andrea Righi <arighi@...dia.com>, Kuba Piecuch <jpiecuch@...gle.com>
Cc: Tejun Heo <tj@...nel.org>, David Vernet <void@...ifault.com>, 
	Changwoo Min <changwoo@...lia.com>, Christian Loehle <christian.loehle@....com>, 
	Emil Tsalapatis <emil@...alapatis.com>, Daniel Hodges <hodgesd@...a.com>, <sched-ext@...ts.linux.dev>, 
	<linux-kernel@...r.kernel.org>
Subject: Re: [PATCH] sched_ext: Invalidate dispatch decisions on CPU affinity changes

On Wed Feb 4, 2026 at 3:36 PM UTC, Andrea Righi wrote:
>> >
>> > When finish_dispatch() detects a qseq mismatch, the dispatch is dropped
>> > and the task is returned to the SCX_OPSS_QUEUED state, allowing it to be
>> > re-dispatched using up-to-date affinity information.
>> 
>> How will the scheduler know that the dispatch was dropped? Is the scheduler
>> expected to infer it from the ops.enqueue() that follows set_cpus_allowed_scx()
>> on CPU1?
>
> The idea was that, if the dispatch is dropped, we'll see another
> ops.enqueue() for the task, so at least the task is not "lost" and the
> BPF scheduler gets another chance what to do with it. In this case it'd be
> useful to set SCX_ENQ_REENQ (or a dedicated special flag) to indicate that
> the enqueue resulted from a dropped dispatch.

I think SCX_ENQ_REENQ is enough for now, we can always add a dedicated flag
if a need for it arises.

I still worry about the scenario you described. In particular, I think it can
lead to tasks being forgotten (i.e. not re-enqueued) after a failed dispatch.

  CPU0                                      CPU1
  ----                                      ----
  if (cpumask_test_cpu(cpu, p->cpus_ptr))
                                            task_rq_lock(p)
                                            dequeue_task_scx(p, ...)
                                              (remove p from internal queues)
                                            set_cpus_allowed_scx(p, new_mask)
                                            enqueue_task_scx(p, ...)
                                              (add p to internal queues)
                                            task_rq_unlock(p)
      (remove p from internal queues)
      scx_bpf_dsq_insert(p,
              SCX_DSQ_LOCAL_ON | cpu, 0)

In this scenario, the ops.enqueue() which is supposed to notify the BPF
scheduler about the failed dispatch actually happens _before_ the actual
dispatch, so once the dispatch fails, the task won't be re-enqueued.

There are two problems here:

1. CPU0 makes a scheduling decision based on stale data and it isn't detected.
2. Even if it is detected and the dispatch aborted, the task won't be
   re-enqueued.

The way we deal with the first problem in ghOSt (Google's equivalent of
sched_ext) is we expose the per-task sequence number to the BPF scheduler.
On the dispatch path, when the BPF scheduler has a candidate task,
it retrieves its seqnum, re-checks the task state to ensure that it's still
eligible for dispatch, and passes the seqnum to the kernel's dispatch helper
for verification. If the kernel detects that the seqnum has changed already,
it synchronously fails the dispatch attempt (dispatch always happens
synchronously in ghOSt). In sched_ext, we could do the synchronous check, but
we also need to do the same check later in finish_dispatch(), comparing
the current qseq against the qseq passed by the BPF scheduler.

To fix the second problem, we would need to explicitly call ops.enqueue()
from finish_dispatch() and the other places where we abort dispatch if the
qseq is out of date.

Either that, or just add locking to the BPF scheduler to prevent the race from
happening in the first place.

Thanks,
Kuba