linux-kernel - Re: [PATCH 2/2] selftests/sched_ext: Add test to validate ops.dequeue() semantics

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-Id: <DG9BLVVE2TJQ.381PKRLX41AM2@etsalapatis.com>
Date: Sun, 08 Feb 2026 00:11:11 -0500
From: "Emil Tsalapatis" <emil@...alapatis.com>
To: "Andrea Righi" <arighi@...dia.com>
Cc: "Tejun Heo" <tj@...nel.org>, "David Vernet" <void@...ifault.com>,
 "Changwoo Min" <changwoo@...lia.com>, "Kuba Piecuch" <jpiecuch@...gle.com>,
 "Christian Loehle" <christian.loehle@....com>, "Daniel Hodges"
 <hodgesd@...a.com>, <sched-ext@...ts.linux.dev>,
 <linux-kernel@...r.kernel.org>
Subject: Re: [PATCH 2/2] selftests/sched_ext: Add test to validate
 ops.dequeue() semantics

On Sat Feb 7, 2026 at 4:16 AM EST, Andrea Righi wrote:
> Hi Emil,
>

Hi Andrea,

> On Fri, Feb 06, 2026 at 03:10:55PM -0500, Emil Tsalapatis wrote:
>> On Fri Feb 6, 2026 at 8:54 AM EST, Andrea Righi wrote:
>> 
>> Hi Andrea,
>> 
>> > Add a new kselftest to validate that the new ops.dequeue() semantics
>> > work correctly for all task lifecycle scenarios, including the
>> > distinction between terminal DSQs (where BPF scheduler is done with the
>> > task), user DSQs (where BPF scheduler manages the task lifecycle) and
>> > BPF data structures, regardless of which event performs the dispatch.
>> >
>> > The test validates the following scenarios:
>> >
>> >  - From ops.select_cpu():
>> >      - scenario 0 (local DSQ): tasks dispatched to the local DSQ bypass
>> >        the BPF scheduler entirely; they never enter BPF custody, so
>> >        ops.dequeue() is not called,
>> >      - scenario 1 (global DSQ): tasks dispatched to SCX_DSQ_GLOBAL also
>> >        bypass the BPF scheduler, like the local DSQ; ops.dequeue() is
>> >        not called,
>> >      - scenario 2 (user DSQ): tasks enter BPF scheduler custody with full
>> >        enqueue/dequeue lifecycle tracking and state machine validation
>> >        (expects 1:1 enqueue/dequeue pairing).
>> 
>> Could you add a note here about why there's no equivalent to scenario 6?
>> The differentiating factor between that and scenario 2 (nonterminal queue) is 
>> that scx_dsq_insert_commit() is called regardless of whether the queue is terminal.
>> And this makes sense since for non-DSQ queues the BPF scheduler can do its
>> own tracking of enqueue/dequeue (plus it does not make too much sense to
>> do BPF-internal enqueueing in select_cpu).
>> 
>> What do you think? If the above makes sense, maybe we should spell it out 
>> in the documentation too. Maybe also add it makes no sense to enqueue
>> in an internal BPF structure from select_cpu - the task is not yet
>> enqueued, and would have to go through enqueue anyway.
>
> Oh, I just didn't think about it, we can definitely add to ops.select_cpu()
> a scenario equivalent to scenario 6 (push task to the BPF queue).
>
> From a practical standpoint the benefits are questionable, but in the scope
> of the kselftest I think it makes sense to better validate the entire state
> machine in all cases. I'll add this scenario as well.
>

That makes sense! Let's add it for completeness. Even if it doesn't make
sense right now that may change in the future. For example, if we end
up finding a good reason to add the task into an internal structure from
.select_cpu(), we may allow the task to be explicitly marked as being in
the BPF scheduler's custody from a kfunc. Right now we can't do that
from select_cpu() unless we direct dispatch IIUC.

>> 
>> >
>> >    - From ops.enqueue():
>> >      - scenario 3 (local DSQ): same behavior as scenario 0,
>> >      - scenario 4 (global DSQ): same behavior as scenario 1,
>> >      - scenario 5 (user DSQ): same behavior as scenario 2,
>> >      - scenario 6 (BPF internal queue): tasks are stored in a BPF queue
>> >        in ops.enqueue() and consumed in ops.dispatch(); they remain in
>> >        BPF custody until dispatch, with full lifecycle tracking and 1:1
>> >        enqueue/dequeue validation.
>> >
>> > This verifies that:
>> >  - terminal DSQ dispatch (local, global) don't trigger ops.dequeue(),
>> >  - user DSQ / internal BPF data structure dispatch has exact 1:1
>> >    ops.enqueue()/dequeue() pairing,
>> >  - dispatch dequeues have no flags (normal workflow),
>> >  - property change dequeues have the %SCX_DEQ_SCHED_CHANGE flag set,
>> >  - no duplicate enqueues or invalid state transitions are happening,
>> >  - ops.enqueue() and ops.select_cpu() dispatch paths behave identically.
>> >
>> > Cc: Tejun Heo <tj@...nel.org>
>> > Cc: Emil Tsalapatis <emil@...alapatis.com>
>> > Cc: Kuba Piecuch <jpiecuch@...gle.com>
>> > Signed-off-by: Andrea Righi <arighi@...dia.com>
>> > ---
>> >  tools/testing/selftests/sched_ext/Makefile    |   1 +
>> >  .../testing/selftests/sched_ext/dequeue.bpf.c | 403 ++++++++++++++++++
>> >  tools/testing/selftests/sched_ext/dequeue.c   | 258 +++++++++++
>> >  3 files changed, 662 insertions(+)
>> >  create mode 100644 tools/testing/selftests/sched_ext/dequeue.bpf.c
>> >  create mode 100644 tools/testing/selftests/sched_ext/dequeue.c
>> >
>> > diff --git a/tools/testing/selftests/sched_ext/Makefile b/tools/testing/selftests/sched_ext/Makefile
>> > index 5fe45f9c5f8fd..764e91edabf93 100644
>> > --- a/tools/testing/selftests/sched_ext/Makefile
>> > +++ b/tools/testing/selftests/sched_ext/Makefile
>> > @@ -161,6 +161,7 @@ all_test_bpfprogs := $(foreach prog,$(wildcard *.bpf.c),$(INCLUDE_DIR)/$(patsubs
>> >  
>> >  auto-test-targets :=			\
>> >  	create_dsq			\
>> > +	dequeue				\
>> >  	enq_last_no_enq_fails		\
>> >  	ddsp_bogus_dsq_fail		\
>> >  	ddsp_vtimelocal_fail		\
>> > diff --git a/tools/testing/selftests/sched_ext/dequeue.bpf.c b/tools/testing/selftests/sched_ext/dequeue.bpf.c
>> > new file mode 100644
>> > index 0000000000000..4ba657ba1bff5
>> > --- /dev/null
>> > +++ b/tools/testing/selftests/sched_ext/dequeue.bpf.c
>> > @@ -0,0 +1,403 @@
>> > +// SPDX-License-Identifier: GPL-2.0
>> > +/*
>> > + * A scheduler that validates ops.dequeue() is called correctly:
>> > + * - Tasks dispatched to terminal DSQs (local, global) bypass the BPF
>> > + *   scheduler entirely: no ops.dequeue() should be called
>> > + * - Tasks dispatched to user DSQs enter BPF custody: ops.dequeue() must be
>> > + *   called when they leave custody
>> > + * - Every ops.enqueue() for non-terminal DSQs is followed by exactly one
>> > + *   ops.dequeue() (validate 1:1 pairing and state machine)
>> > + *
>> > + * Copyright (c) 2026 NVIDIA Corporation.
>> > + */
>> > +
>> > +#include <scx/common.bpf.h>
>> > +
>> > +#define SHARED_DSQ	0
>> > +
>> > +/*
>> > + * Scenario 6: BPF internal queue. Tasks are stored here from ops.enqueue()
>> > + * and consumed from ops.dispatch(), validating that tasks not on a user DSQ
>> > + * (only on BPF internal structures) still get ops.dequeue() when they leave.
>> > + */
>> > +struct {
>> > +	__uint(type, BPF_MAP_TYPE_QUEUE);
>> > +	__uint(max_entries, 4096);
>> 
>> Nit: Can we make this larger? I don't think there's any downsides. I know
>> there's a mitigation for if the queue gets full, please see nit below.
>
> Sure, like 32768?
>
> Or we can keep it like this so we can potentially test also the fallback
> path sometimes (mixed BPF queue dispatches + built-in DSQ dispatches).
>

32K makes sense. If we keep the fallback, maybe we can just add a
WARN_ON_ONCE() equivalent that it is being triggered so that we make
sure we don't trigger it every single time (e.g. because the BPF queue
is misbehaving)?

>> 
>> > +	__type(value, s32);
>> > +} global_queue SEC(".maps");
>> > +
>> > +char _license[] SEC("license") = "GPL";
>> > +
>> > +UEI_DEFINE(uei);
>> > +
>> > +/*
>> > + * Counters to track the lifecycle of tasks:
>> > + * - enqueue_cnt: Number of times ops.enqueue() was called
>> > + * - dequeue_cnt: Number of times ops.dequeue() was called (any type)
>> > + * - dispatch_dequeue_cnt: Number of regular dispatch dequeues (no flag)
>> > + * - change_dequeue_cnt: Number of property change dequeues
>> > + */
>> > +u64 enqueue_cnt, dequeue_cnt, dispatch_dequeue_cnt, change_dequeue_cnt;
>> > +
>> > +/*
>> > + * Test scenarios (0-2: ops.select_cpu(), 3-6: ops.enqueue()):
>> > + * 0) Dispatch to local DSQ from ops.select_cpu() (terminal DSQ, bypasses BPF
>> > + *    scheduler, no dequeue callbacks)
>> > + * 1) Dispatch to global DSQ from ops.select_cpu() (terminal DSQ, bypasses BPF
>> > + *    scheduler, no dequeue callbacks)
>> > + * 2) Dispatch to shared user DSQ from ops.select_cpu() (enters BPF scheduler,
>> > + *    dequeue callbacks expected)
>> > + * 3) Dispatch to local DSQ from ops.enqueue() (terminal DSQ, bypasses BPF
>> > + *    scheduler, no dequeue callbacks)
>> > + * 4) Dispatch to global DSQ from ops.enqueue() (terminal DSQ, bypasses BPF
>> > + *    scheduler, no dequeue callbacks)
>> > + * 5) Dispatch to shared user DSQ from ops.enqueue() (enters BPF scheduler,
>> > + *    dequeue callbacks expected)
>> > + * 6) BPF internal queue: store task PIDs in ops.enqueue(), consume in
>> > + *    ops.dispatch() and dispatch to local DSQ (validates dequeue for tasks
>> > + *    in BPF custody but not on a user DSQ)
>> > + */
>> > +u32 test_scenario;
>> > +
>> > +/*
>> > + * Per-task state to track lifecycle and validate workflow semantics.
>> > + * State transitions:
>> > + *   NONE -> ENQUEUED (on enqueue)
>> > + *   ENQUEUED -> DISPATCHED (on dispatch dequeue)
>> > + *   DISPATCHED -> NONE (on property change dequeue or re-enqueue)
>> > + *   ENQUEUED -> NONE (on property change dequeue before dispatch)
>> > + */
>> > +enum task_state {
>> > +	TASK_NONE = 0,
>> > +	TASK_ENQUEUED,
>> > +	TASK_DISPATCHED,
>> > +};
>> > +
>> > +struct task_ctx {
>> > +	enum task_state state; /* Current state in the workflow */
>> > +	u64 enqueue_seq;       /* Sequence number for debugging */
>> > +};
>> > +
>> > +struct {
>> > +	__uint(type, BPF_MAP_TYPE_TASK_STORAGE);
>> > +	__uint(map_flags, BPF_F_NO_PREALLOC);
>> > +	__type(key, int);
>> > +	__type(value, struct task_ctx);
>> > +} task_ctx_stor SEC(".maps");
>> > +
>> > +static struct task_ctx *try_lookup_task_ctx(struct task_struct *p)
>> > +{
>> > +	return bpf_task_storage_get(&task_ctx_stor, p, 0, 0);
>> > +}
>> > +
>> > +s32 BPF_STRUCT_OPS(dequeue_select_cpu, struct task_struct *p,
>> > +		   s32 prev_cpu, u64 wake_flags)
>> > +{
>> > +	struct task_ctx *tctx;
>> > +
>> > +	tctx = try_lookup_task_ctx(p);
>> > +	if (!tctx)
>> > +		return prev_cpu;
>> > +
>> > +	switch (test_scenario) {
>> > +	case 0:
>> > +		/*
>> > +		 * Scenario 0: Direct dispatch to local DSQ from select_cpu.
>> > +		 *
>> > +		 * Task bypasses BPF scheduler entirely: no enqueue
>> > +		 * tracking, no dequeue callbacks. Behavior should be
>> > +		 * identical to scenario 3.
>> > +		 */
>> > +		scx_bpf_dsq_insert(p, SCX_DSQ_LOCAL, SCX_SLICE_DFL, 0);
>> > +		return prev_cpu;
>> > +
>> > +	case 1:
>> > +		/*
>> > +		 * Scenario 1: Direct dispatch to global DSQ from select_cpu.
>> > +		 *
>> > +		 * Like scenario 0, task bypasses BPF scheduler entirely.
>> > +		 * Behavior should be identical to scenario 4.
>> > +		 */
>> > +		scx_bpf_dsq_insert(p, SCX_DSQ_GLOBAL, SCX_SLICE_DFL, 0);
>> > +		return prev_cpu;
>> > +
>> > +	case 2:
>> > +		/*
>> > +		 * Scenario 2: Dispatch to shared user DSQ from select_cpu.
>> > +		 *
>> > +		 * Task enters BPF scheduler management: track
>> > +		 * enqueue/dequeue lifecycle and validate state transitions.
>> > +		 * Behavior should be identical to scenario 5.
>> > +		 */
>> > +		__sync_fetch_and_add(&enqueue_cnt, 1);
>> > +
>> > +		/*
>> > +		 * Validate state transition: enqueue is only valid from
>> > +		 * NONE or DISPATCHED states. Getting enqueue while in
>> > +		 * ENQUEUED state indicates a missing dequeue.
>> > +		 */
>> > +		if (tctx->state == TASK_ENQUEUED)
>> > +			scx_bpf_error("%d (%s): enqueue while in ENQUEUED state seq=%llu",
>> > +				      p->pid, p->comm, tctx->enqueue_seq);
>> > +
>> > +		/* Transition to ENQUEUED state */
>> > +		tctx->state = TASK_ENQUEUED;
>> > +		tctx->enqueue_seq++;
>> > +
>> > +		scx_bpf_dsq_insert(p, SHARED_DSQ, SCX_SLICE_DFL, 0);
>> > +		return prev_cpu;
>> > +
>> > +	default:
>> > +		/*
>> > +		 * Force all tasks through ops.enqueue().
>> > +		 */
>> > +		return prev_cpu;
>> > +	}
>> > +}
>> > +
>> > +void BPF_STRUCT_OPS(dequeue_enqueue, struct task_struct *p, u64 enq_flags)
>> > +{
>> > +	struct task_ctx *tctx;
>> > +
>> > +	tctx = try_lookup_task_ctx(p);
>> > +	if (!tctx)
>> > +		return;
>> > +
>> > +	switch (test_scenario) {
>> > +	case 3:
>> > +		/*
>> > +		 * Scenario 3: Direct dispatch to the local DSQ.
>> > +		 *
>> > +		 * Task bypasses BPF scheduler entirely: no enqueue
>> > +		 * tracking, no dequeue callbacks. Don't increment counters
>> > +		 * or validate state since the task never enters BPF
>> > +		 * scheduler management.
>> > +		 */
>> > +		scx_bpf_dsq_insert(p, SCX_DSQ_LOCAL, SCX_SLICE_DFL, enq_flags);
>> > +		break;
>> > +
>> > +	case 4:
>> > +		/*
>> > +		 * Scenario 4: Direct dispatch to the global DSQ.
>> > +		 *
>> > +		 * Like scenario 3, task bypasses BPF scheduler entirely.
>> > +		 * SCX_DSQ_GLOBAL is a terminal DSQ, tasks dispatched to it
>> > +		 * leave BPF custody immediately, so no dequeue callbacks
>> > +		 * should be triggered.
>> > +		 */
>> > +		scx_bpf_dsq_insert(p, SCX_DSQ_GLOBAL, SCX_SLICE_DFL, enq_flags);
>> > +		break;
>> > +
>> > +	case 5:
>> > +		/*
>> > +		 * Scenario 5: Dispatch to shared user DSQ.
>> > +		 *
>> > +		 * Task enters BPF scheduler management: track
>> > +		 * enqueue/dequeue lifecycle and validate state
>> > +		 * transitions.
>> > +		 */
>> > +		__sync_fetch_and_add(&enqueue_cnt, 1);
>> > +
>> > +		/*
>> > +		 * Validate state transition: enqueue is only valid from
>> > +		 * NONE or DISPATCHED states. Getting enqueue while in
>> > +		 * ENQUEUED state indicates a missing dequeue (or stale state
>> > +		 * from a previous scenario when the scheduler was unregistered
>> > +		 * with tasks still on a DSQ). Reset and proceed to avoid false
>> > +		 * positives across scenario switches.
>> > +		 */
>> > +		if (tctx->state == TASK_ENQUEUED)
>> > +			tctx->state = TASK_NONE;
>> > +
>> > +		/* Transition to ENQUEUED state */
>> > +		tctx->state = TASK_ENQUEUED;
>> > +		tctx->enqueue_seq++;
>> > +
>> > +		scx_bpf_dsq_insert(p, SHARED_DSQ, SCX_SLICE_DFL, enq_flags);
>> > +		break;
>> > +
>> > +	case 6:
>> > +		/*
>> > +		 * Scenario 6: Store task in BPF internal queue. Task enters
>> > +		 * BPF custody (kernel sets SCX_TASK_NEED_DEQ). When
>> > +		 * ops.dispatch() later pops and inserts to local DSQ,
>> > +		 * ops.dequeue() must be called.
>> > +		 *
>> > +		 * If the queue is full, fallback to local DSQ. The task still
>> > +		 * goes through QUEUED in the kernel and gets ops.dequeue()
>> > +		 * when moved to the terminal DSQ, so we track it the same.
>> > +		 *
>> > +		 * If state is already ENQUEUED (e.g. task was on a DSQ when
>> > +		 * the scheduler was unregistered in a previous scenario),
>> > +		 * reset to NONE and proceed to avoid false positives.
>> > +		 */
>> > +		{
>> > +			s32 pid = p->pid;
>> > +
>> > +			if (tctx->state == TASK_ENQUEUED)
>> > +				tctx->state = TASK_NONE;
>> > +
>> > +			tctx->state = TASK_ENQUEUED;
>> > +			tctx->enqueue_seq++;
>> > +
>> > +			/* Queue full: fallback to the global DSQ */
>> Nit: Can we remove this fallback? This silently changes the behavior of
>> the test, and even though it makes sense to avoid overflowing the queue,
>> it causes the test to succeed even if for some reason the
>> bpf_map_push_elem fails. Why not just bump the queue number to a
>> reasonably large number amount instead?
>
> Hm... but if for any reason we overflow the queue we'd get a false positive
> error: task is ignored, we trigger a stall and it looks like something is
> wrong in ops.dequeue(). WDYT?
>

I agree, but if we bump the queue size to a large number the probability
of that is nonexistent: I think these test make sense to run in CI-like
environments where there's few processes anyway, so if a queue is large
enough there will not be enough tasks to overflow it anyway. This is an
assumption we make for dsp_local_on, too. Maybe we can keep the fallback
but warn when it's used (see above)?

> Thanks,
> -Andrea