[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-Id: <DG9BLVVE2TJQ.381PKRLX41AM2@etsalapatis.com>
Date: Sun, 08 Feb 2026 00:11:11 -0500
From: "Emil Tsalapatis" <emil@...alapatis.com>
To: "Andrea Righi" <arighi@...dia.com>
Cc: "Tejun Heo" <tj@...nel.org>, "David Vernet" <void@...ifault.com>,
"Changwoo Min" <changwoo@...lia.com>, "Kuba Piecuch" <jpiecuch@...gle.com>,
"Christian Loehle" <christian.loehle@....com>, "Daniel Hodges"
<hodgesd@...a.com>, <sched-ext@...ts.linux.dev>,
<linux-kernel@...r.kernel.org>
Subject: Re: [PATCH 2/2] selftests/sched_ext: Add test to validate
ops.dequeue() semantics
On Sat Feb 7, 2026 at 4:16 AM EST, Andrea Righi wrote:
> Hi Emil,
>
Hi Andrea,
> On Fri, Feb 06, 2026 at 03:10:55PM -0500, Emil Tsalapatis wrote:
>> On Fri Feb 6, 2026 at 8:54 AM EST, Andrea Righi wrote:
>>
>> Hi Andrea,
>>
>> > Add a new kselftest to validate that the new ops.dequeue() semantics
>> > work correctly for all task lifecycle scenarios, including the
>> > distinction between terminal DSQs (where BPF scheduler is done with the
>> > task), user DSQs (where BPF scheduler manages the task lifecycle) and
>> > BPF data structures, regardless of which event performs the dispatch.
>> >
>> > The test validates the following scenarios:
>> >
>> > - From ops.select_cpu():
>> > - scenario 0 (local DSQ): tasks dispatched to the local DSQ bypass
>> > the BPF scheduler entirely; they never enter BPF custody, so
>> > ops.dequeue() is not called,
>> > - scenario 1 (global DSQ): tasks dispatched to SCX_DSQ_GLOBAL also
>> > bypass the BPF scheduler, like the local DSQ; ops.dequeue() is
>> > not called,
>> > - scenario 2 (user DSQ): tasks enter BPF scheduler custody with full
>> > enqueue/dequeue lifecycle tracking and state machine validation
>> > (expects 1:1 enqueue/dequeue pairing).
>>
>> Could you add a note here about why there's no equivalent to scenario 6?
>> The differentiating factor between that and scenario 2 (nonterminal queue) is
>> that scx_dsq_insert_commit() is called regardless of whether the queue is terminal.
>> And this makes sense since for non-DSQ queues the BPF scheduler can do its
>> own tracking of enqueue/dequeue (plus it does not make too much sense to
>> do BPF-internal enqueueing in select_cpu).
>>
>> What do you think? If the above makes sense, maybe we should spell it out
>> in the documentation too. Maybe also add it makes no sense to enqueue
>> in an internal BPF structure from select_cpu - the task is not yet
>> enqueued, and would have to go through enqueue anyway.
>
> Oh, I just didn't think about it, we can definitely add to ops.select_cpu()
> a scenario equivalent to scenario 6 (push task to the BPF queue).
>
> From a practical standpoint the benefits are questionable, but in the scope
> of the kselftest I think it makes sense to better validate the entire state
> machine in all cases. I'll add this scenario as well.
>
That makes sense! Let's add it for completeness. Even if it doesn't make
sense right now that may change in the future. For example, if we end
up finding a good reason to add the task into an internal structure from
.select_cpu(), we may allow the task to be explicitly marked as being in
the BPF scheduler's custody from a kfunc. Right now we can't do that
from select_cpu() unless we direct dispatch IIUC.
>>
>> >
>> > - From ops.enqueue():
>> > - scenario 3 (local DSQ): same behavior as scenario 0,
>> > - scenario 4 (global DSQ): same behavior as scenario 1,
>> > - scenario 5 (user DSQ): same behavior as scenario 2,
>> > - scenario 6 (BPF internal queue): tasks are stored in a BPF queue
>> > in ops.enqueue() and consumed in ops.dispatch(); they remain in
>> > BPF custody until dispatch, with full lifecycle tracking and 1:1
>> > enqueue/dequeue validation.
>> >
>> > This verifies that:
>> > - terminal DSQ dispatch (local, global) don't trigger ops.dequeue(),
>> > - user DSQ / internal BPF data structure dispatch has exact 1:1
>> > ops.enqueue()/dequeue() pairing,
>> > - dispatch dequeues have no flags (normal workflow),
>> > - property change dequeues have the %SCX_DEQ_SCHED_CHANGE flag set,
>> > - no duplicate enqueues or invalid state transitions are happening,
>> > - ops.enqueue() and ops.select_cpu() dispatch paths behave identically.
>> >
>> > Cc: Tejun Heo <tj@...nel.org>
>> > Cc: Emil Tsalapatis <emil@...alapatis.com>
>> > Cc: Kuba Piecuch <jpiecuch@...gle.com>
>> > Signed-off-by: Andrea Righi <arighi@...dia.com>
>> > ---
>> > tools/testing/selftests/sched_ext/Makefile | 1 +
>> > .../testing/selftests/sched_ext/dequeue.bpf.c | 403 ++++++++++++++++++
>> > tools/testing/selftests/sched_ext/dequeue.c | 258 +++++++++++
>> > 3 files changed, 662 insertions(+)
>> > create mode 100644 tools/testing/selftests/sched_ext/dequeue.bpf.c
>> > create mode 100644 tools/testing/selftests/sched_ext/dequeue.c
>> >
>> > diff --git a/tools/testing/selftests/sched_ext/Makefile b/tools/testing/selftests/sched_ext/Makefile
>> > index 5fe45f9c5f8fd..764e91edabf93 100644
>> > --- a/tools/testing/selftests/sched_ext/Makefile
>> > +++ b/tools/testing/selftests/sched_ext/Makefile
>> > @@ -161,6 +161,7 @@ all_test_bpfprogs := $(foreach prog,$(wildcard *.bpf.c),$(INCLUDE_DIR)/$(patsubs
>> >
>> > auto-test-targets := \
>> > create_dsq \
>> > + dequeue \
>> > enq_last_no_enq_fails \
>> > ddsp_bogus_dsq_fail \
>> > ddsp_vtimelocal_fail \
>> > diff --git a/tools/testing/selftests/sched_ext/dequeue.bpf.c b/tools/testing/selftests/sched_ext/dequeue.bpf.c
>> > new file mode 100644
>> > index 0000000000000..4ba657ba1bff5
>> > --- /dev/null
>> > +++ b/tools/testing/selftests/sched_ext/dequeue.bpf.c
>> > @@ -0,0 +1,403 @@
>> > +// SPDX-License-Identifier: GPL-2.0
>> > +/*
>> > + * A scheduler that validates ops.dequeue() is called correctly:
>> > + * - Tasks dispatched to terminal DSQs (local, global) bypass the BPF
>> > + * scheduler entirely: no ops.dequeue() should be called
>> > + * - Tasks dispatched to user DSQs enter BPF custody: ops.dequeue() must be
>> > + * called when they leave custody
>> > + * - Every ops.enqueue() for non-terminal DSQs is followed by exactly one
>> > + * ops.dequeue() (validate 1:1 pairing and state machine)
>> > + *
>> > + * Copyright (c) 2026 NVIDIA Corporation.
>> > + */
>> > +
>> > +#include <scx/common.bpf.h>
>> > +
>> > +#define SHARED_DSQ 0
>> > +
>> > +/*
>> > + * Scenario 6: BPF internal queue. Tasks are stored here from ops.enqueue()
>> > + * and consumed from ops.dispatch(), validating that tasks not on a user DSQ
>> > + * (only on BPF internal structures) still get ops.dequeue() when they leave.
>> > + */
>> > +struct {
>> > + __uint(type, BPF_MAP_TYPE_QUEUE);
>> > + __uint(max_entries, 4096);
>>
>> Nit: Can we make this larger? I don't think there's any downsides. I know
>> there's a mitigation for if the queue gets full, please see nit below.
>
> Sure, like 32768?
>
> Or we can keep it like this so we can potentially test also the fallback
> path sometimes (mixed BPF queue dispatches + built-in DSQ dispatches).
>
32K makes sense. If we keep the fallback, maybe we can just add a
WARN_ON_ONCE() equivalent that it is being triggered so that we make
sure we don't trigger it every single time (e.g. because the BPF queue
is misbehaving)?
>>
>> > + __type(value, s32);
>> > +} global_queue SEC(".maps");
>> > +
>> > +char _license[] SEC("license") = "GPL";
>> > +
>> > +UEI_DEFINE(uei);
>> > +
>> > +/*
>> > + * Counters to track the lifecycle of tasks:
>> > + * - enqueue_cnt: Number of times ops.enqueue() was called
>> > + * - dequeue_cnt: Number of times ops.dequeue() was called (any type)
>> > + * - dispatch_dequeue_cnt: Number of regular dispatch dequeues (no flag)
>> > + * - change_dequeue_cnt: Number of property change dequeues
>> > + */
>> > +u64 enqueue_cnt, dequeue_cnt, dispatch_dequeue_cnt, change_dequeue_cnt;
>> > +
>> > +/*
>> > + * Test scenarios (0-2: ops.select_cpu(), 3-6: ops.enqueue()):
>> > + * 0) Dispatch to local DSQ from ops.select_cpu() (terminal DSQ, bypasses BPF
>> > + * scheduler, no dequeue callbacks)
>> > + * 1) Dispatch to global DSQ from ops.select_cpu() (terminal DSQ, bypasses BPF
>> > + * scheduler, no dequeue callbacks)
>> > + * 2) Dispatch to shared user DSQ from ops.select_cpu() (enters BPF scheduler,
>> > + * dequeue callbacks expected)
>> > + * 3) Dispatch to local DSQ from ops.enqueue() (terminal DSQ, bypasses BPF
>> > + * scheduler, no dequeue callbacks)
>> > + * 4) Dispatch to global DSQ from ops.enqueue() (terminal DSQ, bypasses BPF
>> > + * scheduler, no dequeue callbacks)
>> > + * 5) Dispatch to shared user DSQ from ops.enqueue() (enters BPF scheduler,
>> > + * dequeue callbacks expected)
>> > + * 6) BPF internal queue: store task PIDs in ops.enqueue(), consume in
>> > + * ops.dispatch() and dispatch to local DSQ (validates dequeue for tasks
>> > + * in BPF custody but not on a user DSQ)
>> > + */
>> > +u32 test_scenario;
>> > +
>> > +/*
>> > + * Per-task state to track lifecycle and validate workflow semantics.
>> > + * State transitions:
>> > + * NONE -> ENQUEUED (on enqueue)
>> > + * ENQUEUED -> DISPATCHED (on dispatch dequeue)
>> > + * DISPATCHED -> NONE (on property change dequeue or re-enqueue)
>> > + * ENQUEUED -> NONE (on property change dequeue before dispatch)
>> > + */
>> > +enum task_state {
>> > + TASK_NONE = 0,
>> > + TASK_ENQUEUED,
>> > + TASK_DISPATCHED,
>> > +};
>> > +
>> > +struct task_ctx {
>> > + enum task_state state; /* Current state in the workflow */
>> > + u64 enqueue_seq; /* Sequence number for debugging */
>> > +};
>> > +
>> > +struct {
>> > + __uint(type, BPF_MAP_TYPE_TASK_STORAGE);
>> > + __uint(map_flags, BPF_F_NO_PREALLOC);
>> > + __type(key, int);
>> > + __type(value, struct task_ctx);
>> > +} task_ctx_stor SEC(".maps");
>> > +
>> > +static struct task_ctx *try_lookup_task_ctx(struct task_struct *p)
>> > +{
>> > + return bpf_task_storage_get(&task_ctx_stor, p, 0, 0);
>> > +}
>> > +
>> > +s32 BPF_STRUCT_OPS(dequeue_select_cpu, struct task_struct *p,
>> > + s32 prev_cpu, u64 wake_flags)
>> > +{
>> > + struct task_ctx *tctx;
>> > +
>> > + tctx = try_lookup_task_ctx(p);
>> > + if (!tctx)
>> > + return prev_cpu;
>> > +
>> > + switch (test_scenario) {
>> > + case 0:
>> > + /*
>> > + * Scenario 0: Direct dispatch to local DSQ from select_cpu.
>> > + *
>> > + * Task bypasses BPF scheduler entirely: no enqueue
>> > + * tracking, no dequeue callbacks. Behavior should be
>> > + * identical to scenario 3.
>> > + */
>> > + scx_bpf_dsq_insert(p, SCX_DSQ_LOCAL, SCX_SLICE_DFL, 0);
>> > + return prev_cpu;
>> > +
>> > + case 1:
>> > + /*
>> > + * Scenario 1: Direct dispatch to global DSQ from select_cpu.
>> > + *
>> > + * Like scenario 0, task bypasses BPF scheduler entirely.
>> > + * Behavior should be identical to scenario 4.
>> > + */
>> > + scx_bpf_dsq_insert(p, SCX_DSQ_GLOBAL, SCX_SLICE_DFL, 0);
>> > + return prev_cpu;
>> > +
>> > + case 2:
>> > + /*
>> > + * Scenario 2: Dispatch to shared user DSQ from select_cpu.
>> > + *
>> > + * Task enters BPF scheduler management: track
>> > + * enqueue/dequeue lifecycle and validate state transitions.
>> > + * Behavior should be identical to scenario 5.
>> > + */
>> > + __sync_fetch_and_add(&enqueue_cnt, 1);
>> > +
>> > + /*
>> > + * Validate state transition: enqueue is only valid from
>> > + * NONE or DISPATCHED states. Getting enqueue while in
>> > + * ENQUEUED state indicates a missing dequeue.
>> > + */
>> > + if (tctx->state == TASK_ENQUEUED)
>> > + scx_bpf_error("%d (%s): enqueue while in ENQUEUED state seq=%llu",
>> > + p->pid, p->comm, tctx->enqueue_seq);
>> > +
>> > + /* Transition to ENQUEUED state */
>> > + tctx->state = TASK_ENQUEUED;
>> > + tctx->enqueue_seq++;
>> > +
>> > + scx_bpf_dsq_insert(p, SHARED_DSQ, SCX_SLICE_DFL, 0);
>> > + return prev_cpu;
>> > +
>> > + default:
>> > + /*
>> > + * Force all tasks through ops.enqueue().
>> > + */
>> > + return prev_cpu;
>> > + }
>> > +}
>> > +
>> > +void BPF_STRUCT_OPS(dequeue_enqueue, struct task_struct *p, u64 enq_flags)
>> > +{
>> > + struct task_ctx *tctx;
>> > +
>> > + tctx = try_lookup_task_ctx(p);
>> > + if (!tctx)
>> > + return;
>> > +
>> > + switch (test_scenario) {
>> > + case 3:
>> > + /*
>> > + * Scenario 3: Direct dispatch to the local DSQ.
>> > + *
>> > + * Task bypasses BPF scheduler entirely: no enqueue
>> > + * tracking, no dequeue callbacks. Don't increment counters
>> > + * or validate state since the task never enters BPF
>> > + * scheduler management.
>> > + */
>> > + scx_bpf_dsq_insert(p, SCX_DSQ_LOCAL, SCX_SLICE_DFL, enq_flags);
>> > + break;
>> > +
>> > + case 4:
>> > + /*
>> > + * Scenario 4: Direct dispatch to the global DSQ.
>> > + *
>> > + * Like scenario 3, task bypasses BPF scheduler entirely.
>> > + * SCX_DSQ_GLOBAL is a terminal DSQ, tasks dispatched to it
>> > + * leave BPF custody immediately, so no dequeue callbacks
>> > + * should be triggered.
>> > + */
>> > + scx_bpf_dsq_insert(p, SCX_DSQ_GLOBAL, SCX_SLICE_DFL, enq_flags);
>> > + break;
>> > +
>> > + case 5:
>> > + /*
>> > + * Scenario 5: Dispatch to shared user DSQ.
>> > + *
>> > + * Task enters BPF scheduler management: track
>> > + * enqueue/dequeue lifecycle and validate state
>> > + * transitions.
>> > + */
>> > + __sync_fetch_and_add(&enqueue_cnt, 1);
>> > +
>> > + /*
>> > + * Validate state transition: enqueue is only valid from
>> > + * NONE or DISPATCHED states. Getting enqueue while in
>> > + * ENQUEUED state indicates a missing dequeue (or stale state
>> > + * from a previous scenario when the scheduler was unregistered
>> > + * with tasks still on a DSQ). Reset and proceed to avoid false
>> > + * positives across scenario switches.
>> > + */
>> > + if (tctx->state == TASK_ENQUEUED)
>> > + tctx->state = TASK_NONE;
>> > +
>> > + /* Transition to ENQUEUED state */
>> > + tctx->state = TASK_ENQUEUED;
>> > + tctx->enqueue_seq++;
>> > +
>> > + scx_bpf_dsq_insert(p, SHARED_DSQ, SCX_SLICE_DFL, enq_flags);
>> > + break;
>> > +
>> > + case 6:
>> > + /*
>> > + * Scenario 6: Store task in BPF internal queue. Task enters
>> > + * BPF custody (kernel sets SCX_TASK_NEED_DEQ). When
>> > + * ops.dispatch() later pops and inserts to local DSQ,
>> > + * ops.dequeue() must be called.
>> > + *
>> > + * If the queue is full, fallback to local DSQ. The task still
>> > + * goes through QUEUED in the kernel and gets ops.dequeue()
>> > + * when moved to the terminal DSQ, so we track it the same.
>> > + *
>> > + * If state is already ENQUEUED (e.g. task was on a DSQ when
>> > + * the scheduler was unregistered in a previous scenario),
>> > + * reset to NONE and proceed to avoid false positives.
>> > + */
>> > + {
>> > + s32 pid = p->pid;
>> > +
>> > + if (tctx->state == TASK_ENQUEUED)
>> > + tctx->state = TASK_NONE;
>> > +
>> > + tctx->state = TASK_ENQUEUED;
>> > + tctx->enqueue_seq++;
>> > +
>> > + /* Queue full: fallback to the global DSQ */
>> Nit: Can we remove this fallback? This silently changes the behavior of
>> the test, and even though it makes sense to avoid overflowing the queue,
>> it causes the test to succeed even if for some reason the
>> bpf_map_push_elem fails. Why not just bump the queue number to a
>> reasonably large number amount instead?
>
> Hm... but if for any reason we overflow the queue we'd get a false positive
> error: task is ignored, we trigger a stall and it looks like something is
> wrong in ops.dequeue(). WDYT?
>
I agree, but if we bump the queue size to a large number the probability
of that is nonexistent: I think these test make sense to run in CI-like
environments where there's few processes anyway, so if a queue is large
enough there will not be enough tasks to overflow it anyway. This is an
assumption we make for dsp_local_on, too. Maybe we can keep the fallback
but warn when it's used (see above)?
> Thanks,
> -Andrea
Powered by blists - more mailing lists