linux-kernel - Re: [PATCH] sched_ext: Invalidate dispatch decisions on CPU affinity changes

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <aYWpqDtzjci42qa0@gpd4>
Date: Fri, 6 Feb 2026 09:43:20 +0100
From: Andrea Righi <arighi@...dia.com>
To: Tejun Heo <tj@...nel.org>
Cc: David Vernet <void@...ifault.com>, Changwoo Min <changwoo@...lia.com>,
	Christian Loehle <christian.loehle@....com>,
	Emil Tsalapatis <emil@...alapatis.com>,
	Daniel Hodges <hodgesd@...a.com>, sched-ext@...ts.linux.dev,
	linux-kernel@...r.kernel.org
Subject: Re: [PATCH] sched_ext: Invalidate dispatch decisions on CPU affinity
 changes

On Thu, Feb 05, 2026 at 12:57:04PM -1000, Tejun Heo wrote:
> Hello,
> 
> On Thu, Feb 05, 2026 at 05:40:05PM +0100, Andrea Righi wrote:
> ...
> > > It shouldn't be returned, right? set_cpus_allowed() dequeues and
> > > re-enqueues. What the seq invalidation detected is dequeue racing the async
> > > dispatch and the invalidation means that the task was dequeued while on the
> > > async buffer (to be re-enqueued once the property change is complete). It
> > > should just be ignored.
> > 
> > Yeah, the only downside is that the scheduler doesn't know that the task
> > has been re-enqueued due to a failed dispatch, but that's probably fine for
> > now.
> 
> Yeah, but does that matter? Consider the following three scenarios:
> 
> A. Task gets dispatched into local DSQ, CPU mask gets updated while in async
>    buffer, the dispatch is ignored and then the task gets re-enqueued later.
> 
> B. The same as A but the CPU mask update happens after the task lands in the
>    local DSQ but before starts executing.
> 
> C. Task gets dispatched into local DSQ and starts running, CPU mask gets
>    updated so that the task can't run on the current CPU anymore, migration
>    task preempts the task and it gets enqueued.
> 
> A and B woould be indistinguishible from BPF sched's POV. C would be a bit
> different in that the task would transition through ops->running/stopping().
> 
> I don't see anything significantly different across the three scenarios -
> the task was dispatched but cpumask got updated and the scheduler needs to
> place it again.

Ack, knowing that an enqueue is coming from a failed dispatch or a regular
running/stopping transition probably doesn't matter. And the scheduler can
probably try to infer this information, if needed.

> 
> ...
> > > Now, maybe we want to allow BPF schedulre to be lax about ops.dequeue()
> > > synchronization and let things slide (probably optionally w/ an OPS flag),
> > > but for that, falling back to global DSQ is fine, no?
> > 
> > I think the problem with the global DSQ fallback is that we're essentially
> > ignoring a request from the BPF scheduler to dispatch a task to a specific
> > CPU. Moreover, the global DSQ can potentially introduce starvation: if a
> > task is silently dispatched to the global DSQ and the BPF scheduler keeps
> > dispatching tasks to the local DSQs, the task waiting in the global DSQ
> > will never be consumed.
> 
> While starvation is possible, it's not very likely:
> 
> - ops.select_cpu/enqueue() usually don't direct dispatch to local CPUs
>   unless they're idle.
> 
> - ops.dispatch() is only called after global DSQ is drained.
> 
> If ops.select_cpu/enqueue() keeps DD'ing to local CPUs while there are other
> tasks waiting, it's gonna stall whether we fall back to global DSQ or not.

Well, if a scheduler is only using DD's with SCX_DSQ_LOCAL[_ON] and a
per-CPU task ends up in the global DSQ the chance of starvation is not that
remote.

If we re-enqueue the task in the regular enqueue path, without the global
DSQ fallback, the task will be dispatched again by the BPF scheduler via
SCX_DSQ_LOCAL[_ON] and it's less likely to be starved.

I think if we just drop the dispatch without the global DSQ fallback
everything should work.

> 
> But, taking a step back, the sloppy fallback behavior is secondary. What
> really matters is once we fix ops.dequeue(), can the BPF scheduler properly
> synchronize dequeue against scx_bpf_dsq_insert() to avoid triggering cpumask
> or migration disabled state mismatches? If so, ops.dequeue() would be the
> primary way to deal with these issues.

Agreed, we should handle this in ops_dequeue().

> 
> Maybe not implementing ops.dequeue() can enable sloppy fallbacks as that
> indicates the scheduler isn't taking property changes into account at all,
> but that's really secondary. Let's first focus on making ops.dequeue()
> working properly so that the BPF scheduler can synchronize correctly.

Ack.

> 
> ...
> > > I wonder whether we should define an invalid qseq and use that instead. The
> > > queueing instance really is invalid after this and it would help catching
> > > cases where BPF scheduler makes mistakes w/ synchronization. Also, wouldn't
> > > dequeue_task_scx() or ops_dequeue() be a better place to shoot down the
> > > enqueued instances? While the symptom we most immediately see are through
> > > cpumask changes, the underlying problem is dequeue not shooting down
> > > existing enqueued tasks.
> > 
> > I think I like the idea of having an INVALID_QSEQ or similar, it'd also
> > make debugging easier.
> >
> > I'm not sure about moving the logic to dequeue_task_scx(), more exactly,
> > I'm not sure if there're nasty locking implications. I'll do some
> > experiments, if it works, sure, dequeue would be a better place to cancel
> > invalid enqueued instances.
> 
> I was confused while writing above. All of the above is already happening.
> When a task is dequeued, it's OPSS is cleared and the task won't be eligible
> for dispatching anymore. The only "confused" case is where the task finishes
> reenqueueing before the previous dispatch attempt is finished, which the BPF
> scheduler should be able to handle once ops.dequeue() is fixed.

Yeah I think we can handle this in the dequeue path. I think I have a new
working (maybe) patch. Will run some tests and send the new version later
today.

Thanks,
-Andrea