linux-kernel - Re: [PATCH] sched_ext: Invalidate dispatch decisions on CPU affinity changes

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [day] [month] [year] [list]

Message-ID: <aYUgQGvzJsoEaDZL@slm.duckdns.org>
Date: Thu, 5 Feb 2026 12:57:04 -1000
From: Tejun Heo <tj@...nel.org>
To: Andrea Righi <arighi@...dia.com>
Cc: David Vernet <void@...ifault.com>, Changwoo Min <changwoo@...lia.com>,
	Christian Loehle <christian.loehle@....com>,
	Emil Tsalapatis <emil@...alapatis.com>,
	Daniel Hodges <hodgesd@...a.com>, sched-ext@...ts.linux.dev,
	linux-kernel@...r.kernel.org
Subject: Re: [PATCH] sched_ext: Invalidate dispatch decisions on CPU affinity
 changes

Hello,

On Thu, Feb 05, 2026 at 05:40:05PM +0100, Andrea Righi wrote:
...
> > It shouldn't be returned, right? set_cpus_allowed() dequeues and
> > re-enqueues. What the seq invalidation detected is dequeue racing the async
> > dispatch and the invalidation means that the task was dequeued while on the
> > async buffer (to be re-enqueued once the property change is complete). It
> > should just be ignored.
> 
> Yeah, the only downside is that the scheduler doesn't know that the task
> has been re-enqueued due to a failed dispatch, but that's probably fine for
> now.

Yeah, but does that matter? Consider the following three scenarios:

A. Task gets dispatched into local DSQ, CPU mask gets updated while in async
   buffer, the dispatch is ignored and then the task gets re-enqueued later.

B. The same as A but the CPU mask update happens after the task lands in the
   local DSQ but before starts executing.

C. Task gets dispatched into local DSQ and starts running, CPU mask gets
   updated so that the task can't run on the current CPU anymore, migration
   task preempts the task and it gets enqueued.

A and B woould be indistinguishible from BPF sched's POV. C would be a bit
different in that the task would transition through ops->running/stopping().

I don't see anything significantly different across the three scenarios -
the task was dispatched but cpumask got updated and the scheduler needs to
place it again.

...
> > Now, maybe we want to allow BPF schedulre to be lax about ops.dequeue()
> > synchronization and let things slide (probably optionally w/ an OPS flag),
> > but for that, falling back to global DSQ is fine, no?
> 
> I think the problem with the global DSQ fallback is that we're essentially
> ignoring a request from the BPF scheduler to dispatch a task to a specific
> CPU. Moreover, the global DSQ can potentially introduce starvation: if a
> task is silently dispatched to the global DSQ and the BPF scheduler keeps
> dispatching tasks to the local DSQs, the task waiting in the global DSQ
> will never be consumed.

While starvation is possible, it's not very likely:

- ops.select_cpu/enqueue() usually don't direct dispatch to local CPUs
  unless they're idle.

- ops.dispatch() is only called after global DSQ is drained.

If ops.select_cpu/enqueue() keeps DD'ing to local CPUs while there are other
tasks waiting, it's gonna stall whether we fall back to global DSQ or not.

But, taking a step back, the sloppy fallback behavior is secondary. What
really matters is once we fix ops.dequeue(), can the BPF scheduler properly
synchronize dequeue against scx_bpf_dsq_insert() to avoid triggering cpumask
or migration disabled state mismatches? If so, ops.dequeue() would be the
primary way to deal with these issues.

Maybe not implementing ops.dequeue() can enable sloppy fallbacks as that
indicates the scheduler isn't taking property changes into account at all,
but that's really secondary. Let's first focus on making ops.dequeue()
working properly so that the BPF scheduler can synchronize correctly.

...
> > I wonder whether we should define an invalid qseq and use that instead. The
> > queueing instance really is invalid after this and it would help catching
> > cases where BPF scheduler makes mistakes w/ synchronization. Also, wouldn't
> > dequeue_task_scx() or ops_dequeue() be a better place to shoot down the
> > enqueued instances? While the symptom we most immediately see are through
> > cpumask changes, the underlying problem is dequeue not shooting down
> > existing enqueued tasks.
> 
> I think I like the idea of having an INVALID_QSEQ or similar, it'd also
> make debugging easier.
>
> I'm not sure about moving the logic to dequeue_task_scx(), more exactly,
> I'm not sure if there're nasty locking implications. I'll do some
> experiments, if it works, sure, dequeue would be a better place to cancel
> invalid enqueued instances.

I was confused while writing above. All of the above is already happening.
When a task is dequeued, it's OPSS is cleared and the task won't be eligible
for dispatching anymore. The only "confused" case is where the task finishes
reenqueueing before the previous dispatch attempt is finished, which the BPF
scheduler should be able to handle once ops.dequeue() is fixed.

Thanks.

-- 
tejun