linux-kernel - Re: [PATCH 1/2] sched_ext: Fix ops.dequeue() semantics

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite for Android: free password hash cracker in your pocket

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <aVG9BBWIQLL8uJrQ@slm.duckdns.org>
Date: Sun, 28 Dec 2025 13:28:04 -1000
From: Tejun Heo <tj@...nel.org>
To: Andrea Righi <arighi@...dia.com>
Cc: David Vernet <void@...ifault.com>, Changwoo Min <changwoo@...lia.com>,
	Emil Tsalapatis <emil@...alapatis.com>,
	Daniel Hodges <hodgesd@...a.com>, sched-ext@...ts.linux.dev,
	linux-kernel@...r.kernel.org
Subject: Re: [PATCH 1/2] sched_ext: Fix ops.dequeue() semantics

Hello,

On Sun, Dec 28, 2025 at 07:19:46AM -1000, Tejun Heo wrote:
> While this works, from the BPF sched's POV, there's no way to tell whether
> an ops.dequeue() call is from the task being actually dequeued or the
> follow-up to the dispatch operation it just did. This won't make much
> difference if ops.dequeue() is just used for accounting purposes, but, a
> scheduler which uses an arena data structure for queueing would likely need
> to perform extra tests to tell whether the task needs to be dequeued from
> the arena side. I *think* hot path (ops.dequeue() following the task's
> dispatch) can be a simple lockless test, so this may be okay, but from API
> POV, it can probably be better.
> 
> The counter interlocking point is scx_bpf_dsq_insert(). If we can
> synchronize scx_bpf_dsq_insert() and dequeue so that ops.dequeue() is not
> called for a successfully inserted task, I think the semantics would be
> neater - an enqueued task is either dispatched or dequeued. Due to the async
> dispatch operation, this likely is difficult to do without adding extra sync
> operations in scx_bpf_dsq_insert(). However, I *think* we may be able to get
> rid of dspc and async inserting if we call ops.dispatch() w/ rq lock
> dropped. That may make the whole dispatch path simpler and the behavior
> neater too. What do you think?

I sat down and went through the code to see whether I was actually making
sense, and I wasn't:

The async dispatch buffering is necessary to avoid lock inversion between rq
lock and whatever locks the BPF scheduler might be using internally. This is
necessary because enqueue path runs with rq lock held. Thus, any lock that
BPF sched uses in tne enqueue path has to nest inside rq lock.

In dispatch, scx_bpf_dsq_insert() is likely to be called with the same BPF
sched side lock held. If we try to do rq lock dancing synchronously, we can
end up trying to grab rq lock while holding BPF side lock leading to
deadlock.

Kernel side has no control over BPF side locking, so the asynchronous
operation is there to side-step the issue. I don't see a good way to make
this synchronous.

So, please ignore that part. That's non-sense. I still wonder whether we can
create some interlocking between scx_bpf_dsq_insert() and ops.dequeue()
without making hot path slower. I'll think more about it.

Thanks.

-- 
tejun