linux-kernel - Re: [PATCH 2/2] selftests/sched_ext: Add test to validate ops.dequeue() semantics

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <aYhRKbkHf-or9lNO@gpd4>
Date: Sun, 8 Feb 2026 10:02:33 +0100
From: Andrea Righi <arighi@...dia.com>
To: Emil Tsalapatis <emil@...alapatis.com>
Cc: Tejun Heo <tj@...nel.org>, David Vernet <void@...ifault.com>,
	Changwoo Min <changwoo@...lia.com>,
	Kuba Piecuch <jpiecuch@...gle.com>,
	Christian Loehle <christian.loehle@....com>,
	Daniel Hodges <hodgesd@...a.com>, sched-ext@...ts.linux.dev,
	linux-kernel@...r.kernel.org
Subject: Re: [PATCH 2/2] selftests/sched_ext: Add test to validate
 ops.dequeue() semantics

On Sun, Feb 08, 2026 at 12:11:11AM -0500, Emil Tsalapatis wrote:
> On Sat Feb 7, 2026 at 4:16 AM EST, Andrea Righi wrote:
> > Hi Emil,
> >
> 
> Hi Andrea,
> 
> > On Fri, Feb 06, 2026 at 03:10:55PM -0500, Emil Tsalapatis wrote:
> >> On Fri Feb 6, 2026 at 8:54 AM EST, Andrea Righi wrote:
> >> 
> >> Hi Andrea,
> >> 
> >> > Add a new kselftest to validate that the new ops.dequeue() semantics
> >> > work correctly for all task lifecycle scenarios, including the
> >> > distinction between terminal DSQs (where BPF scheduler is done with the
> >> > task), user DSQs (where BPF scheduler manages the task lifecycle) and
> >> > BPF data structures, regardless of which event performs the dispatch.
> >> >
> >> > The test validates the following scenarios:
> >> >
> >> >  - From ops.select_cpu():
> >> >      - scenario 0 (local DSQ): tasks dispatched to the local DSQ bypass
> >> >        the BPF scheduler entirely; they never enter BPF custody, so
> >> >        ops.dequeue() is not called,
> >> >      - scenario 1 (global DSQ): tasks dispatched to SCX_DSQ_GLOBAL also
> >> >        bypass the BPF scheduler, like the local DSQ; ops.dequeue() is
> >> >        not called,
> >> >      - scenario 2 (user DSQ): tasks enter BPF scheduler custody with full
> >> >        enqueue/dequeue lifecycle tracking and state machine validation
> >> >        (expects 1:1 enqueue/dequeue pairing).
> >> 
> >> Could you add a note here about why there's no equivalent to scenario 6?
> >> The differentiating factor between that and scenario 2 (nonterminal queue) is 
> >> that scx_dsq_insert_commit() is called regardless of whether the queue is terminal.
> >> And this makes sense since for non-DSQ queues the BPF scheduler can do its
> >> own tracking of enqueue/dequeue (plus it does not make too much sense to
> >> do BPF-internal enqueueing in select_cpu).
> >> 
> >> What do you think? If the above makes sense, maybe we should spell it out 
> >> in the documentation too. Maybe also add it makes no sense to enqueue
> >> in an internal BPF structure from select_cpu - the task is not yet
> >> enqueued, and would have to go through enqueue anyway.
> >
> > Oh, I just didn't think about it, we can definitely add to ops.select_cpu()
> > a scenario equivalent to scenario 6 (push task to the BPF queue).
> >
> > From a practical standpoint the benefits are questionable, but in the scope
> > of the kselftest I think it makes sense to better validate the entire state
> > machine in all cases. I'll add this scenario as well.
> >
> 
> That makes sense! Let's add it for completeness. Even if it doesn't make
> sense right now that may change in the future. For example, if we end
> up finding a good reason to add the task into an internal structure from
> .select_cpu(), we may allow the task to be explicitly marked as being in
> the BPF scheduler's custody from a kfunc. Right now we can't do that
> from select_cpu() unless we direct dispatch IIUC.

Ok, I'll send a new patch later with the new scenario included. It should
work already (if done properly in the test case), I think we don't need to
change anything in the kernel.

We may need
https://lore.kernel.org/all/20260203230639.1259869-1-arighi@nvidia.com to
avoid hitting potential affinity issues, but that's a separate thing.

I was also considerinng to include this in the series (which I have a v2),
but it's not strictly related to ops.dequeue() and it actually depends on
this ops.dequeue() change, so I think it makes sense to fix this later as a
separate patch.

> 
> >> 
> >> >
> >> >    - From ops.enqueue():
> >> >      - scenario 3 (local DSQ): same behavior as scenario 0,
> >> >      - scenario 4 (global DSQ): same behavior as scenario 1,
> >> >      - scenario 5 (user DSQ): same behavior as scenario 2,
> >> >      - scenario 6 (BPF internal queue): tasks are stored in a BPF queue
> >> >        in ops.enqueue() and consumed in ops.dispatch(); they remain in
> >> >        BPF custody until dispatch, with full lifecycle tracking and 1:1
> >> >        enqueue/dequeue validation.
> >> >
> >> > This verifies that:
> >> >  - terminal DSQ dispatch (local, global) don't trigger ops.dequeue(),
> >> >  - user DSQ / internal BPF data structure dispatch has exact 1:1
> >> >    ops.enqueue()/dequeue() pairing,
> >> >  - dispatch dequeues have no flags (normal workflow),
> >> >  - property change dequeues have the %SCX_DEQ_SCHED_CHANGE flag set,
> >> >  - no duplicate enqueues or invalid state transitions are happening,
> >> >  - ops.enqueue() and ops.select_cpu() dispatch paths behave identically.
> >> >
> >> > Cc: Tejun Heo <tj@...nel.org>
> >> > Cc: Emil Tsalapatis <emil@...alapatis.com>
> >> > Cc: Kuba Piecuch <jpiecuch@...gle.com>
> >> > Signed-off-by: Andrea Righi <arighi@...dia.com>
> >> > ---
> >> >  tools/testing/selftests/sched_ext/Makefile    |   1 +
> >> >  .../testing/selftests/sched_ext/dequeue.bpf.c | 403 ++++++++++++++++++
> >> >  tools/testing/selftests/sched_ext/dequeue.c   | 258 +++++++++++
> >> >  3 files changed, 662 insertions(+)
> >> >  create mode 100644 tools/testing/selftests/sched_ext/dequeue.bpf.c
> >> >  create mode 100644 tools/testing/selftests/sched_ext/dequeue.c
> >> >
> >> > diff --git a/tools/testing/selftests/sched_ext/Makefile b/tools/testing/selftests/sched_ext/Makefile
> >> > index 5fe45f9c5f8fd..764e91edabf93 100644
> >> > --- a/tools/testing/selftests/sched_ext/Makefile
> >> > +++ b/tools/testing/selftests/sched_ext/Makefile
> >> > @@ -161,6 +161,7 @@ all_test_bpfprogs := $(foreach prog,$(wildcard *.bpf.c),$(INCLUDE_DIR)/$(patsubs
> >> >  
> >> >  auto-test-targets :=			\
> >> >  	create_dsq			\
> >> > +	dequeue				\
> >> >  	enq_last_no_enq_fails		\
> >> >  	ddsp_bogus_dsq_fail		\
> >> >  	ddsp_vtimelocal_fail		\
> >> > diff --git a/tools/testing/selftests/sched_ext/dequeue.bpf.c b/tools/testing/selftests/sched_ext/dequeue.bpf.c
> >> > new file mode 100644
> >> > index 0000000000000..4ba657ba1bff5
> >> > --- /dev/null
> >> > +++ b/tools/testing/selftests/sched_ext/dequeue.bpf.c
> >> > @@ -0,0 +1,403 @@
> >> > +// SPDX-License-Identifier: GPL-2.0
> >> > +/*
> >> > + * A scheduler that validates ops.dequeue() is called correctly:
> >> > + * - Tasks dispatched to terminal DSQs (local, global) bypass the BPF
> >> > + *   scheduler entirely: no ops.dequeue() should be called
> >> > + * - Tasks dispatched to user DSQs enter BPF custody: ops.dequeue() must be
> >> > + *   called when they leave custody
> >> > + * - Every ops.enqueue() for non-terminal DSQs is followed by exactly one
> >> > + *   ops.dequeue() (validate 1:1 pairing and state machine)
> >> > + *
> >> > + * Copyright (c) 2026 NVIDIA Corporation.
> >> > + */
> >> > +
> >> > +#include <scx/common.bpf.h>
> >> > +
> >> > +#define SHARED_DSQ	0
> >> > +
> >> > +/*
> >> > + * Scenario 6: BPF internal queue. Tasks are stored here from ops.enqueue()
> >> > + * and consumed from ops.dispatch(), validating that tasks not on a user DSQ
> >> > + * (only on BPF internal structures) still get ops.dequeue() when they leave.
> >> > + */
> >> > +struct {
> >> > +	__uint(type, BPF_MAP_TYPE_QUEUE);
> >> > +	__uint(max_entries, 4096);
> >> 
> >> Nit: Can we make this larger? I don't think there's any downsides. I know
> >> there's a mitigation for if the queue gets full, please see nit below.
> >
> > Sure, like 32768?
> >
> > Or we can keep it like this so we can potentially test also the fallback
> > path sometimes (mixed BPF queue dispatches + built-in DSQ dispatches).
> >
> 
> 32K makes sense. If we keep the fallback, maybe we can just add a
> WARN_ON_ONCE() equivalent that it is being triggered so that we make
> sure we don't trigger it every single time (e.g. because the BPF queue
> is misbehaving)?

Maybe we can add a bpf_queue_full_cnt counter, similarly to the other task
lifecycle counters and print it to stdout as the result of the test.

In this way, the fallback won't be considered a critical failure and we
have this metric included in the output if we need this information.

> 
> >> 
> >> > +	__type(value, s32);
> >> > +} global_queue SEC(".maps");
> >> > +
> >> > +char _license[] SEC("license") = "GPL";
> >> > +
> >> > +UEI_DEFINE(uei);
> >> > +
> >> > +/*
> >> > + * Counters to track the lifecycle of tasks:
> >> > + * - enqueue_cnt: Number of times ops.enqueue() was called
> >> > + * - dequeue_cnt: Number of times ops.dequeue() was called (any type)
> >> > + * - dispatch_dequeue_cnt: Number of regular dispatch dequeues (no flag)
> >> > + * - change_dequeue_cnt: Number of property change dequeues
> >> > + */
> >> > +u64 enqueue_cnt, dequeue_cnt, dispatch_dequeue_cnt, change_dequeue_cnt;
> >> > +
> >> > +/*
> >> > + * Test scenarios (0-2: ops.select_cpu(), 3-6: ops.enqueue()):
> >> > + * 0) Dispatch to local DSQ from ops.select_cpu() (terminal DSQ, bypasses BPF
> >> > + *    scheduler, no dequeue callbacks)
> >> > + * 1) Dispatch to global DSQ from ops.select_cpu() (terminal DSQ, bypasses BPF
> >> > + *    scheduler, no dequeue callbacks)
> >> > + * 2) Dispatch to shared user DSQ from ops.select_cpu() (enters BPF scheduler,
> >> > + *    dequeue callbacks expected)
> >> > + * 3) Dispatch to local DSQ from ops.enqueue() (terminal DSQ, bypasses BPF
> >> > + *    scheduler, no dequeue callbacks)
> >> > + * 4) Dispatch to global DSQ from ops.enqueue() (terminal DSQ, bypasses BPF
> >> > + *    scheduler, no dequeue callbacks)
> >> > + * 5) Dispatch to shared user DSQ from ops.enqueue() (enters BPF scheduler,
> >> > + *    dequeue callbacks expected)
> >> > + * 6) BPF internal queue: store task PIDs in ops.enqueue(), consume in
> >> > + *    ops.dispatch() and dispatch to local DSQ (validates dequeue for tasks
> >> > + *    in BPF custody but not on a user DSQ)
> >> > + */
> >> > +u32 test_scenario;
> >> > +
> >> > +/*
> >> > + * Per-task state to track lifecycle and validate workflow semantics.
> >> > + * State transitions:
> >> > + *   NONE -> ENQUEUED (on enqueue)
> >> > + *   ENQUEUED -> DISPATCHED (on dispatch dequeue)
> >> > + *   DISPATCHED -> NONE (on property change dequeue or re-enqueue)
> >> > + *   ENQUEUED -> NONE (on property change dequeue before dispatch)
> >> > + */
> >> > +enum task_state {
> >> > +	TASK_NONE = 0,
> >> > +	TASK_ENQUEUED,
> >> > +	TASK_DISPATCHED,
> >> > +};
> >> > +
> >> > +struct task_ctx {
> >> > +	enum task_state state; /* Current state in the workflow */
> >> > +	u64 enqueue_seq;       /* Sequence number for debugging */
> >> > +};
> >> > +
> >> > +struct {
> >> > +	__uint(type, BPF_MAP_TYPE_TASK_STORAGE);
> >> > +	__uint(map_flags, BPF_F_NO_PREALLOC);
> >> > +	__type(key, int);
> >> > +	__type(value, struct task_ctx);
> >> > +} task_ctx_stor SEC(".maps");
> >> > +
> >> > +static struct task_ctx *try_lookup_task_ctx(struct task_struct *p)
> >> > +{
> >> > +	return bpf_task_storage_get(&task_ctx_stor, p, 0, 0);
> >> > +}
> >> > +
> >> > +s32 BPF_STRUCT_OPS(dequeue_select_cpu, struct task_struct *p,
> >> > +		   s32 prev_cpu, u64 wake_flags)
> >> > +{
> >> > +	struct task_ctx *tctx;
> >> > +
> >> > +	tctx = try_lookup_task_ctx(p);
> >> > +	if (!tctx)
> >> > +		return prev_cpu;
> >> > +
> >> > +	switch (test_scenario) {
> >> > +	case 0:
> >> > +		/*
> >> > +		 * Scenario 0: Direct dispatch to local DSQ from select_cpu.
> >> > +		 *
> >> > +		 * Task bypasses BPF scheduler entirely: no enqueue
> >> > +		 * tracking, no dequeue callbacks. Behavior should be
> >> > +		 * identical to scenario 3.
> >> > +		 */
> >> > +		scx_bpf_dsq_insert(p, SCX_DSQ_LOCAL, SCX_SLICE_DFL, 0);
> >> > +		return prev_cpu;
> >> > +
> >> > +	case 1:
> >> > +		/*
> >> > +		 * Scenario 1: Direct dispatch to global DSQ from select_cpu.
> >> > +		 *
> >> > +		 * Like scenario 0, task bypasses BPF scheduler entirely.
> >> > +		 * Behavior should be identical to scenario 4.
> >> > +		 */
> >> > +		scx_bpf_dsq_insert(p, SCX_DSQ_GLOBAL, SCX_SLICE_DFL, 0);
> >> > +		return prev_cpu;
> >> > +
> >> > +	case 2:
> >> > +		/*
> >> > +		 * Scenario 2: Dispatch to shared user DSQ from select_cpu.
> >> > +		 *
> >> > +		 * Task enters BPF scheduler management: track
> >> > +		 * enqueue/dequeue lifecycle and validate state transitions.
> >> > +		 * Behavior should be identical to scenario 5.
> >> > +		 */
> >> > +		__sync_fetch_and_add(&enqueue_cnt, 1);
> >> > +
> >> > +		/*
> >> > +		 * Validate state transition: enqueue is only valid from
> >> > +		 * NONE or DISPATCHED states. Getting enqueue while in
> >> > +		 * ENQUEUED state indicates a missing dequeue.
> >> > +		 */
> >> > +		if (tctx->state == TASK_ENQUEUED)
> >> > +			scx_bpf_error("%d (%s): enqueue while in ENQUEUED state seq=%llu",
> >> > +				      p->pid, p->comm, tctx->enqueue_seq);
> >> > +
> >> > +		/* Transition to ENQUEUED state */
> >> > +		tctx->state = TASK_ENQUEUED;
> >> > +		tctx->enqueue_seq++;
> >> > +
> >> > +		scx_bpf_dsq_insert(p, SHARED_DSQ, SCX_SLICE_DFL, 0);
> >> > +		return prev_cpu;
> >> > +
> >> > +	default:
> >> > +		/*
> >> > +		 * Force all tasks through ops.enqueue().
> >> > +		 */
> >> > +		return prev_cpu;
> >> > +	}
> >> > +}
> >> > +
> >> > +void BPF_STRUCT_OPS(dequeue_enqueue, struct task_struct *p, u64 enq_flags)
> >> > +{
> >> > +	struct task_ctx *tctx;
> >> > +
> >> > +	tctx = try_lookup_task_ctx(p);
> >> > +	if (!tctx)
> >> > +		return;
> >> > +
> >> > +	switch (test_scenario) {
> >> > +	case 3:
> >> > +		/*
> >> > +		 * Scenario 3: Direct dispatch to the local DSQ.
> >> > +		 *
> >> > +		 * Task bypasses BPF scheduler entirely: no enqueue
> >> > +		 * tracking, no dequeue callbacks. Don't increment counters
> >> > +		 * or validate state since the task never enters BPF
> >> > +		 * scheduler management.
> >> > +		 */
> >> > +		scx_bpf_dsq_insert(p, SCX_DSQ_LOCAL, SCX_SLICE_DFL, enq_flags);
> >> > +		break;
> >> > +
> >> > +	case 4:
> >> > +		/*
> >> > +		 * Scenario 4: Direct dispatch to the global DSQ.
> >> > +		 *
> >> > +		 * Like scenario 3, task bypasses BPF scheduler entirely.
> >> > +		 * SCX_DSQ_GLOBAL is a terminal DSQ, tasks dispatched to it
> >> > +		 * leave BPF custody immediately, so no dequeue callbacks
> >> > +		 * should be triggered.
> >> > +		 */
> >> > +		scx_bpf_dsq_insert(p, SCX_DSQ_GLOBAL, SCX_SLICE_DFL, enq_flags);
> >> > +		break;
> >> > +
> >> > +	case 5:
> >> > +		/*
> >> > +		 * Scenario 5: Dispatch to shared user DSQ.
> >> > +		 *
> >> > +		 * Task enters BPF scheduler management: track
> >> > +		 * enqueue/dequeue lifecycle and validate state
> >> > +		 * transitions.
> >> > +		 */
> >> > +		__sync_fetch_and_add(&enqueue_cnt, 1);
> >> > +
> >> > +		/*
> >> > +		 * Validate state transition: enqueue is only valid from
> >> > +		 * NONE or DISPATCHED states. Getting enqueue while in
> >> > +		 * ENQUEUED state indicates a missing dequeue (or stale state
> >> > +		 * from a previous scenario when the scheduler was unregistered
> >> > +		 * with tasks still on a DSQ). Reset and proceed to avoid false
> >> > +		 * positives across scenario switches.
> >> > +		 */
> >> > +		if (tctx->state == TASK_ENQUEUED)
> >> > +			tctx->state = TASK_NONE;
> >> > +
> >> > +		/* Transition to ENQUEUED state */
> >> > +		tctx->state = TASK_ENQUEUED;
> >> > +		tctx->enqueue_seq++;
> >> > +
> >> > +		scx_bpf_dsq_insert(p, SHARED_DSQ, SCX_SLICE_DFL, enq_flags);
> >> > +		break;
> >> > +
> >> > +	case 6:
> >> > +		/*
> >> > +		 * Scenario 6: Store task in BPF internal queue. Task enters
> >> > +		 * BPF custody (kernel sets SCX_TASK_NEED_DEQ). When
> >> > +		 * ops.dispatch() later pops and inserts to local DSQ,
> >> > +		 * ops.dequeue() must be called.
> >> > +		 *
> >> > +		 * If the queue is full, fallback to local DSQ. The task still
> >> > +		 * goes through QUEUED in the kernel and gets ops.dequeue()
> >> > +		 * when moved to the terminal DSQ, so we track it the same.
> >> > +		 *
> >> > +		 * If state is already ENQUEUED (e.g. task was on a DSQ when
> >> > +		 * the scheduler was unregistered in a previous scenario),
> >> > +		 * reset to NONE and proceed to avoid false positives.
> >> > +		 */
> >> > +		{
> >> > +			s32 pid = p->pid;
> >> > +
> >> > +			if (tctx->state == TASK_ENQUEUED)
> >> > +				tctx->state = TASK_NONE;
> >> > +
> >> > +			tctx->state = TASK_ENQUEUED;
> >> > +			tctx->enqueue_seq++;
> >> > +
> >> > +			/* Queue full: fallback to the global DSQ */
> >> Nit: Can we remove this fallback? This silently changes the behavior of
> >> the test, and even though it makes sense to avoid overflowing the queue,
> >> it causes the test to succeed even if for some reason the
> >> bpf_map_push_elem fails. Why not just bump the queue number to a
> >> reasonably large number amount instead?
> >
> > Hm... but if for any reason we overflow the queue we'd get a false positive
> > error: task is ignored, we trigger a stall and it looks like something is
> > wrong in ops.dequeue(). WDYT?
> >
> 
> I agree, but if we bump the queue size to a large number the probability
> of that is nonexistent: I think these test make sense to run in CI-like
> environments where there's few processes anyway, so if a queue is large
> enough there will not be enough tasks to overflow it anyway. This is an
> assumption we make for dsp_local_on, too. Maybe we can keep the fallback
> but warn when it's used (see above)?

I think keeping the fallback + report the counter of "queue full" events is
the best option.

Thanks,
-Andrea