linux-kernel - Re: [PATCH 2/2] selftests/sched_ext: Add test to validate ops.dequeue() semantics

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <aYcC_aKj_tO62rXZ@gpd4>
Date: Sat, 7 Feb 2026 10:16:45 +0100
From: Andrea Righi <arighi@...dia.com>
To: Emil Tsalapatis <emil@...alapatis.com>
Cc: Tejun Heo <tj@...nel.org>, David Vernet <void@...ifault.com>,
	Changwoo Min <changwoo@...lia.com>,
	Kuba Piecuch <jpiecuch@...gle.com>,
	Christian Loehle <christian.loehle@....com>,
	Daniel Hodges <hodgesd@...a.com>, sched-ext@...ts.linux.dev,
	linux-kernel@...r.kernel.org
Subject: Re: [PATCH 2/2] selftests/sched_ext: Add test to validate
 ops.dequeue() semantics

Hi Emil,

On Fri, Feb 06, 2026 at 03:10:55PM -0500, Emil Tsalapatis wrote:
> On Fri Feb 6, 2026 at 8:54 AM EST, Andrea Righi wrote:
> 
> Hi Andrea,
> 
> > Add a new kselftest to validate that the new ops.dequeue() semantics
> > work correctly for all task lifecycle scenarios, including the
> > distinction between terminal DSQs (where BPF scheduler is done with the
> > task), user DSQs (where BPF scheduler manages the task lifecycle) and
> > BPF data structures, regardless of which event performs the dispatch.
> >
> > The test validates the following scenarios:
> >
> >  - From ops.select_cpu():
> >      - scenario 0 (local DSQ): tasks dispatched to the local DSQ bypass
> >        the BPF scheduler entirely; they never enter BPF custody, so
> >        ops.dequeue() is not called,
> >      - scenario 1 (global DSQ): tasks dispatched to SCX_DSQ_GLOBAL also
> >        bypass the BPF scheduler, like the local DSQ; ops.dequeue() is
> >        not called,
> >      - scenario 2 (user DSQ): tasks enter BPF scheduler custody with full
> >        enqueue/dequeue lifecycle tracking and state machine validation
> >        (expects 1:1 enqueue/dequeue pairing).
> 
> Could you add a note here about why there's no equivalent to scenario 6?
> The differentiating factor between that and scenario 2 (nonterminal queue) is 
> that scx_dsq_insert_commit() is called regardless of whether the queue is terminal.
> And this makes sense since for non-DSQ queues the BPF scheduler can do its
> own tracking of enqueue/dequeue (plus it does not make too much sense to
> do BPF-internal enqueueing in select_cpu).
> 
> What do you think? If the above makes sense, maybe we should spell it out 
> in the documentation too. Maybe also add it makes no sense to enqueue
> in an internal BPF structure from select_cpu - the task is not yet
> enqueued, and would have to go through enqueue anyway.

Oh, I just didn't think about it, we can definitely add to ops.select_cpu()
a scenario equivalent to scenario 6 (push task to the BPF queue).

>From a practical standpoint the benefits are questionable, but in the scope
of the kselftest I think it makes sense to better validate the entire state
machine in all cases. I'll add this scenario as well.

> 
> >
> >    - From ops.enqueue():
> >      - scenario 3 (local DSQ): same behavior as scenario 0,
> >      - scenario 4 (global DSQ): same behavior as scenario 1,
> >      - scenario 5 (user DSQ): same behavior as scenario 2,
> >      - scenario 6 (BPF internal queue): tasks are stored in a BPF queue
> >        in ops.enqueue() and consumed in ops.dispatch(); they remain in
> >        BPF custody until dispatch, with full lifecycle tracking and 1:1
> >        enqueue/dequeue validation.
> >
> > This verifies that:
> >  - terminal DSQ dispatch (local, global) don't trigger ops.dequeue(),
> >  - user DSQ / internal BPF data structure dispatch has exact 1:1
> >    ops.enqueue()/dequeue() pairing,
> >  - dispatch dequeues have no flags (normal workflow),
> >  - property change dequeues have the %SCX_DEQ_SCHED_CHANGE flag set,
> >  - no duplicate enqueues or invalid state transitions are happening,
> >  - ops.enqueue() and ops.select_cpu() dispatch paths behave identically.
> >
> > Cc: Tejun Heo <tj@...nel.org>
> > Cc: Emil Tsalapatis <emil@...alapatis.com>
> > Cc: Kuba Piecuch <jpiecuch@...gle.com>
> > Signed-off-by: Andrea Righi <arighi@...dia.com>
> > ---
> >  tools/testing/selftests/sched_ext/Makefile    |   1 +
> >  .../testing/selftests/sched_ext/dequeue.bpf.c | 403 ++++++++++++++++++
> >  tools/testing/selftests/sched_ext/dequeue.c   | 258 +++++++++++
> >  3 files changed, 662 insertions(+)
> >  create mode 100644 tools/testing/selftests/sched_ext/dequeue.bpf.c
> >  create mode 100644 tools/testing/selftests/sched_ext/dequeue.c
> >
> > diff --git a/tools/testing/selftests/sched_ext/Makefile b/tools/testing/selftests/sched_ext/Makefile
> > index 5fe45f9c5f8fd..764e91edabf93 100644
> > --- a/tools/testing/selftests/sched_ext/Makefile
> > +++ b/tools/testing/selftests/sched_ext/Makefile
> > @@ -161,6 +161,7 @@ all_test_bpfprogs := $(foreach prog,$(wildcard *.bpf.c),$(INCLUDE_DIR)/$(patsubs
> >  
> >  auto-test-targets :=			\
> >  	create_dsq			\
> > +	dequeue				\
> >  	enq_last_no_enq_fails		\
> >  	ddsp_bogus_dsq_fail		\
> >  	ddsp_vtimelocal_fail		\
> > diff --git a/tools/testing/selftests/sched_ext/dequeue.bpf.c b/tools/testing/selftests/sched_ext/dequeue.bpf.c
> > new file mode 100644
> > index 0000000000000..4ba657ba1bff5
> > --- /dev/null
> > +++ b/tools/testing/selftests/sched_ext/dequeue.bpf.c
> > @@ -0,0 +1,403 @@
> > +// SPDX-License-Identifier: GPL-2.0
> > +/*
> > + * A scheduler that validates ops.dequeue() is called correctly:
> > + * - Tasks dispatched to terminal DSQs (local, global) bypass the BPF
> > + *   scheduler entirely: no ops.dequeue() should be called
> > + * - Tasks dispatched to user DSQs enter BPF custody: ops.dequeue() must be
> > + *   called when they leave custody
> > + * - Every ops.enqueue() for non-terminal DSQs is followed by exactly one
> > + *   ops.dequeue() (validate 1:1 pairing and state machine)
> > + *
> > + * Copyright (c) 2026 NVIDIA Corporation.
> > + */
> > +
> > +#include <scx/common.bpf.h>
> > +
> > +#define SHARED_DSQ	0
> > +
> > +/*
> > + * Scenario 6: BPF internal queue. Tasks are stored here from ops.enqueue()
> > + * and consumed from ops.dispatch(), validating that tasks not on a user DSQ
> > + * (only on BPF internal structures) still get ops.dequeue() when they leave.
> > + */
> > +struct {
> > +	__uint(type, BPF_MAP_TYPE_QUEUE);
> > +	__uint(max_entries, 4096);
> 
> Nit: Can we make this larger? I don't think there's any downsides. I know
> there's a mitigation for if the queue gets full, please see nit below.

Sure, like 32768?

Or we can keep it like this so we can potentially test also the fallback
path sometimes (mixed BPF queue dispatches + built-in DSQ dispatches).

> 
> > +	__type(value, s32);
> > +} global_queue SEC(".maps");
> > +
> > +char _license[] SEC("license") = "GPL";
> > +
> > +UEI_DEFINE(uei);
> > +
> > +/*
> > + * Counters to track the lifecycle of tasks:
> > + * - enqueue_cnt: Number of times ops.enqueue() was called
> > + * - dequeue_cnt: Number of times ops.dequeue() was called (any type)
> > + * - dispatch_dequeue_cnt: Number of regular dispatch dequeues (no flag)
> > + * - change_dequeue_cnt: Number of property change dequeues
> > + */
> > +u64 enqueue_cnt, dequeue_cnt, dispatch_dequeue_cnt, change_dequeue_cnt;
> > +
> > +/*
> > + * Test scenarios (0-2: ops.select_cpu(), 3-6: ops.enqueue()):
> > + * 0) Dispatch to local DSQ from ops.select_cpu() (terminal DSQ, bypasses BPF
> > + *    scheduler, no dequeue callbacks)
> > + * 1) Dispatch to global DSQ from ops.select_cpu() (terminal DSQ, bypasses BPF
> > + *    scheduler, no dequeue callbacks)
> > + * 2) Dispatch to shared user DSQ from ops.select_cpu() (enters BPF scheduler,
> > + *    dequeue callbacks expected)
> > + * 3) Dispatch to local DSQ from ops.enqueue() (terminal DSQ, bypasses BPF
> > + *    scheduler, no dequeue callbacks)
> > + * 4) Dispatch to global DSQ from ops.enqueue() (terminal DSQ, bypasses BPF
> > + *    scheduler, no dequeue callbacks)
> > + * 5) Dispatch to shared user DSQ from ops.enqueue() (enters BPF scheduler,
> > + *    dequeue callbacks expected)
> > + * 6) BPF internal queue: store task PIDs in ops.enqueue(), consume in
> > + *    ops.dispatch() and dispatch to local DSQ (validates dequeue for tasks
> > + *    in BPF custody but not on a user DSQ)
> > + */
> > +u32 test_scenario;
> > +
> > +/*
> > + * Per-task state to track lifecycle and validate workflow semantics.
> > + * State transitions:
> > + *   NONE -> ENQUEUED (on enqueue)
> > + *   ENQUEUED -> DISPATCHED (on dispatch dequeue)
> > + *   DISPATCHED -> NONE (on property change dequeue or re-enqueue)
> > + *   ENQUEUED -> NONE (on property change dequeue before dispatch)
> > + */
> > +enum task_state {
> > +	TASK_NONE = 0,
> > +	TASK_ENQUEUED,
> > +	TASK_DISPATCHED,
> > +};
> > +
> > +struct task_ctx {
> > +	enum task_state state; /* Current state in the workflow */
> > +	u64 enqueue_seq;       /* Sequence number for debugging */
> > +};
> > +
> > +struct {
> > +	__uint(type, BPF_MAP_TYPE_TASK_STORAGE);
> > +	__uint(map_flags, BPF_F_NO_PREALLOC);
> > +	__type(key, int);
> > +	__type(value, struct task_ctx);
> > +} task_ctx_stor SEC(".maps");
> > +
> > +static struct task_ctx *try_lookup_task_ctx(struct task_struct *p)
> > +{
> > +	return bpf_task_storage_get(&task_ctx_stor, p, 0, 0);
> > +}
> > +
> > +s32 BPF_STRUCT_OPS(dequeue_select_cpu, struct task_struct *p,
> > +		   s32 prev_cpu, u64 wake_flags)
> > +{
> > +	struct task_ctx *tctx;
> > +
> > +	tctx = try_lookup_task_ctx(p);
> > +	if (!tctx)
> > +		return prev_cpu;
> > +
> > +	switch (test_scenario) {
> > +	case 0:
> > +		/*
> > +		 * Scenario 0: Direct dispatch to local DSQ from select_cpu.
> > +		 *
> > +		 * Task bypasses BPF scheduler entirely: no enqueue
> > +		 * tracking, no dequeue callbacks. Behavior should be
> > +		 * identical to scenario 3.
> > +		 */
> > +		scx_bpf_dsq_insert(p, SCX_DSQ_LOCAL, SCX_SLICE_DFL, 0);
> > +		return prev_cpu;
> > +
> > +	case 1:
> > +		/*
> > +		 * Scenario 1: Direct dispatch to global DSQ from select_cpu.
> > +		 *
> > +		 * Like scenario 0, task bypasses BPF scheduler entirely.
> > +		 * Behavior should be identical to scenario 4.
> > +		 */
> > +		scx_bpf_dsq_insert(p, SCX_DSQ_GLOBAL, SCX_SLICE_DFL, 0);
> > +		return prev_cpu;
> > +
> > +	case 2:
> > +		/*
> > +		 * Scenario 2: Dispatch to shared user DSQ from select_cpu.
> > +		 *
> > +		 * Task enters BPF scheduler management: track
> > +		 * enqueue/dequeue lifecycle and validate state transitions.
> > +		 * Behavior should be identical to scenario 5.
> > +		 */
> > +		__sync_fetch_and_add(&enqueue_cnt, 1);
> > +
> > +		/*
> > +		 * Validate state transition: enqueue is only valid from
> > +		 * NONE or DISPATCHED states. Getting enqueue while in
> > +		 * ENQUEUED state indicates a missing dequeue.
> > +		 */
> > +		if (tctx->state == TASK_ENQUEUED)
> > +			scx_bpf_error("%d (%s): enqueue while in ENQUEUED state seq=%llu",
> > +				      p->pid, p->comm, tctx->enqueue_seq);
> > +
> > +		/* Transition to ENQUEUED state */
> > +		tctx->state = TASK_ENQUEUED;
> > +		tctx->enqueue_seq++;
> > +
> > +		scx_bpf_dsq_insert(p, SHARED_DSQ, SCX_SLICE_DFL, 0);
> > +		return prev_cpu;
> > +
> > +	default:
> > +		/*
> > +		 * Force all tasks through ops.enqueue().
> > +		 */
> > +		return prev_cpu;
> > +	}
> > +}
> > +
> > +void BPF_STRUCT_OPS(dequeue_enqueue, struct task_struct *p, u64 enq_flags)
> > +{
> > +	struct task_ctx *tctx;
> > +
> > +	tctx = try_lookup_task_ctx(p);
> > +	if (!tctx)
> > +		return;
> > +
> > +	switch (test_scenario) {
> > +	case 3:
> > +		/*
> > +		 * Scenario 3: Direct dispatch to the local DSQ.
> > +		 *
> > +		 * Task bypasses BPF scheduler entirely: no enqueue
> > +		 * tracking, no dequeue callbacks. Don't increment counters
> > +		 * or validate state since the task never enters BPF
> > +		 * scheduler management.
> > +		 */
> > +		scx_bpf_dsq_insert(p, SCX_DSQ_LOCAL, SCX_SLICE_DFL, enq_flags);
> > +		break;
> > +
> > +	case 4:
> > +		/*
> > +		 * Scenario 4: Direct dispatch to the global DSQ.
> > +		 *
> > +		 * Like scenario 3, task bypasses BPF scheduler entirely.
> > +		 * SCX_DSQ_GLOBAL is a terminal DSQ, tasks dispatched to it
> > +		 * leave BPF custody immediately, so no dequeue callbacks
> > +		 * should be triggered.
> > +		 */
> > +		scx_bpf_dsq_insert(p, SCX_DSQ_GLOBAL, SCX_SLICE_DFL, enq_flags);
> > +		break;
> > +
> > +	case 5:
> > +		/*
> > +		 * Scenario 5: Dispatch to shared user DSQ.
> > +		 *
> > +		 * Task enters BPF scheduler management: track
> > +		 * enqueue/dequeue lifecycle and validate state
> > +		 * transitions.
> > +		 */
> > +		__sync_fetch_and_add(&enqueue_cnt, 1);
> > +
> > +		/*
> > +		 * Validate state transition: enqueue is only valid from
> > +		 * NONE or DISPATCHED states. Getting enqueue while in
> > +		 * ENQUEUED state indicates a missing dequeue (or stale state
> > +		 * from a previous scenario when the scheduler was unregistered
> > +		 * with tasks still on a DSQ). Reset and proceed to avoid false
> > +		 * positives across scenario switches.
> > +		 */
> > +		if (tctx->state == TASK_ENQUEUED)
> > +			tctx->state = TASK_NONE;
> > +
> > +		/* Transition to ENQUEUED state */
> > +		tctx->state = TASK_ENQUEUED;
> > +		tctx->enqueue_seq++;
> > +
> > +		scx_bpf_dsq_insert(p, SHARED_DSQ, SCX_SLICE_DFL, enq_flags);
> > +		break;
> > +
> > +	case 6:
> > +		/*
> > +		 * Scenario 6: Store task in BPF internal queue. Task enters
> > +		 * BPF custody (kernel sets SCX_TASK_NEED_DEQ). When
> > +		 * ops.dispatch() later pops and inserts to local DSQ,
> > +		 * ops.dequeue() must be called.
> > +		 *
> > +		 * If the queue is full, fallback to local DSQ. The task still
> > +		 * goes through QUEUED in the kernel and gets ops.dequeue()
> > +		 * when moved to the terminal DSQ, so we track it the same.
> > +		 *
> > +		 * If state is already ENQUEUED (e.g. task was on a DSQ when
> > +		 * the scheduler was unregistered in a previous scenario),
> > +		 * reset to NONE and proceed to avoid false positives.
> > +		 */
> > +		{
> > +			s32 pid = p->pid;
> > +
> > +			if (tctx->state == TASK_ENQUEUED)
> > +				tctx->state = TASK_NONE;
> > +
> > +			tctx->state = TASK_ENQUEUED;
> > +			tctx->enqueue_seq++;
> > +
> > +			/* Queue full: fallback to the global DSQ */
> Nit: Can we remove this fallback? This silently changes the behavior of
> the test, and even though it makes sense to avoid overflowing the queue,
> it causes the test to succeed even if for some reason the
> bpf_map_push_elem fails. Why not just bump the queue number to a
> reasonably large number amount instead?

Hm... but if for any reason we overflow the queue we'd get a false positive
error: task is ignored, we trigger a stall and it looks like something is
wrong in ops.dequeue(). WDYT?

Thanks,
-Andrea