linux-kernel - Re: [PATCH 1/2] sched_ext: Fix ops.dequeue() semantics

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <aVFmstVjLW_QIQis@slm.duckdns.org>
Date: Sun, 28 Dec 2025 07:19:46 -1000
From: Tejun Heo <tj@...nel.org>
To: Andrea Righi <arighi@...dia.com>
Cc: David Vernet <void@...ifault.com>, Changwoo Min <changwoo@...lia.com>,
	Emil Tsalapatis <emil@...alapatis.com>,
	Daniel Hodges <hodgesd@...a.com>, sched-ext@...ts.linux.dev,
	linux-kernel@...r.kernel.org
Subject: Re: [PATCH 1/2] sched_ext: Fix ops.dequeue() semantics

Hello, Andrea.

On Fri, Dec 19, 2025 at 11:43:14PM +0100, Andrea Righi wrote:
...
> +   Once ``ops.enqueue()`` is called, the task is considered "enqueued" and
> +   is owned by the BPF scheduler. Ownership is retained until the task is
> +   either dispatched (moved to a local DSQ for execution) or dequeued
> +   (removed from the scheduler due to a blocking event, or to modify a
> +   property, like CPU affinity, priority, etc.). When the task leaves the
> +   BPF scheduler ``ops.dequeue()`` is invoked.
> +
> +   **Important**: ``ops.dequeue()`` is called for *any* enqueued task,
> +   regardless of whether the task is still on a BPF data structure, or it
> +   is already dispatched to a DSQ (global, local, or user DSQ)
> +
> +   This guarantees that every ``ops.enqueue()`` will eventually be followed
> +   by a ``ops.dequeue()``. This makes it reliable for BPF schedulers to
> +   track task ownership and maintain accurate accounting, such as per-DSQ
> +   queued runtime sums.

While this works, from the BPF sched's POV, there's no way to tell whether
an ops.dequeue() call is from the task being actually dequeued or the
follow-up to the dispatch operation it just did. This won't make much
difference if ops.dequeue() is just used for accounting purposes, but, a
scheduler which uses an arena data structure for queueing would likely need
to perform extra tests to tell whether the task needs to be dequeued from
the arena side. I *think* hot path (ops.dequeue() following the task's
dispatch) can be a simple lockless test, so this may be okay, but from API
POV, it can probably be better.

The counter interlocking point is scx_bpf_dsq_insert(). If we can
synchronize scx_bpf_dsq_insert() and dequeue so that ops.dequeue() is not
called for a successfully inserted task, I think the semantics would be
neater - an enqueued task is either dispatched or dequeued. Due to the async
dispatch operation, this likely is difficult to do without adding extra sync
operations in scx_bpf_dsq_insert(). However, I *think* we may be able to get
rid of dspc and async inserting if we call ops.dispatch() w/ rq lock
dropped. That may make the whole dispatch path simpler and the behavior
neater too. What do you think?

Thanks.

-- 
tejun