lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-Id: <DG860JW64VVD.31BS2QTEB8XZQ@etsalapatis.com>
Date: Fri, 06 Feb 2026 15:35:34 -0500
From: "Emil Tsalapatis" <emil@...alapatis.com>
To: "Andrea Righi" <arighi@...dia.com>, "Tejun Heo" <tj@...nel.org>, "David
 Vernet" <void@...ifault.com>, "Changwoo Min" <changwoo@...lia.com>
Cc: "Kuba Piecuch" <jpiecuch@...gle.com>, "Christian Loehle"
 <christian.loehle@....com>, "Daniel Hodges" <hodgesd@...a.com>,
 <sched-ext@...ts.linux.dev>, <linux-kernel@...r.kernel.org>
Subject: Re: [PATCH 1/2] sched_ext: Fix ops.dequeue() semantics

On Fri Feb 6, 2026 at 8:54 AM EST, Andrea Righi wrote:
> Currently, ops.dequeue() is only invoked when the sched_ext core knows
> that a task resides in BPF-managed data structures, which causes it to
> miss scheduling property change events. In addition, ops.dequeue()
> callbacks are completely skipped when tasks are dispatched to non-local
> DSQs from ops.select_cpu(). As a result, BPF schedulers cannot reliably
> track task state.
>
> Fix this by guaranteeing that each task entering the BPF scheduler's
> custody triggers exactly one ops.dequeue() call when it leaves that
> custody, whether the exit is due to a dispatch (regular or via a core
> scheduling pick) or to a scheduling property change (e.g.
> sched_setaffinity(), sched_setscheduler(), set_user_nice(), NUMA
> balancing, etc.).
>
> BPF scheduler custody concept: a task is considered to be in the BPF
> scheduler's custody when the scheduler is responsible for managing its
> lifecycle. This includes tasks dispatched to user-created DSQs or stored
> in the BPF scheduler's internal data structures. Custody ends when the
> task is dispatched to a terminal DSQ (such as the local DSQ or
> %SCX_DSQ_GLOBAL), selected by core scheduling, or removed due to a
> property change.
>
> Tasks directly dispatched to terminal DSQs bypass the BPF scheduler
> entirely and are never in its custody. Terminal DSQs include:
>  - Local DSQs (%SCX_DSQ_LOCAL or %SCX_DSQ_LOCAL_ON): per-CPU queues
>    where tasks go directly to execution.
>  - Global DSQ (%SCX_DSQ_GLOBAL): the built-in fallback queue where the
>    BPF scheduler is considered "done" with the task.
>
> As a result, ops.dequeue() is not invoked for tasks directly dispatched
> to terminal DSQs.
>
> To identify dequeues triggered by scheduling property changes, introduce
> the new ops.dequeue() flag %SCX_DEQ_SCHED_CHANGE: when this flag is set,
> the dequeue was caused by a scheduling property change.
>
> New ops.dequeue() semantics:
>  - ops.dequeue() is invoked exactly once when the task leaves the BPF
>    scheduler's custody, in one of the following cases:
>    a) regular dispatch: a task dispatched to a user DSQ or stored in
>       internal BPF data structures is moved to a terminal DSQ
>       (ops.dequeue() called without any special flags set),
>    b) core scheduling dispatch: core-sched picks task before dispatch
>       (ops.dequeue() called with %SCX_DEQ_CORE_SCHED_EXEC flag set),
>    c) property change: task properties modified before dispatch,
>       (ops.dequeue() called with %SCX_DEQ_SCHED_CHANGE flag set).
>
> This allows BPF schedulers to:
>  - reliably track task ownership and lifecycle,
>  - maintain accurate accounting of managed tasks,
>  - update internal state when tasks change properties.
>
> Cc: Tejun Heo <tj@...nel.org>
> Cc: Emil Tsalapatis <emil@...alapatis.com>
> Cc: Kuba Piecuch <jpiecuch@...gle.com>
> Signed-off-by: Andrea Righi <arighi@...dia.com>
> ---

Hi Andrea,

>  Documentation/scheduler/sched-ext.rst         |  58 +++++++
>  include/linux/sched/ext.h                     |   1 +
>  kernel/sched/ext.c                            | 157 ++++++++++++++++--
>  kernel/sched/ext_internal.h                   |   7 +
>  .../sched_ext/include/scx/enum_defs.autogen.h |   1 +
>  .../sched_ext/include/scx/enums.autogen.bpf.h |   2 +
>  tools/sched_ext/include/scx/enums.autogen.h   |   1 +
>  7 files changed, 213 insertions(+), 14 deletions(-)
>
> diff --git a/Documentation/scheduler/sched-ext.rst b/Documentation/scheduler/sched-ext.rst
> index 404fe6126a769..fe8c59b0c1477 100644
> --- a/Documentation/scheduler/sched-ext.rst
> +++ b/Documentation/scheduler/sched-ext.rst
> @@ -252,6 +252,62 @@ The following briefly shows how a waking task is scheduled and executed.
>  
>     * Queue the task on the BPF side.
>  
> +   **Task State Tracking and ops.dequeue() Semantics**
> +
> +   A task is in the "BPF scheduler's custody" when the BPF scheduler is
> +   responsible for managing its lifecycle. That includes tasks dispatched
> +   to user-created DSQs or stored in the BPF scheduler's internal data
> +   structures. Once ``ops.select_cpu()`` or ``ops.enqueue()`` is called,
> +   the task may or may not enter custody depending on what the scheduler
> +   does:
> +
> +   * **Directly dispatched to terminal DSQs** (``SCX_DSQ_LOCAL``,
> +     ``SCX_DSQ_LOCAL_ON | cpu``, or ``SCX_DSQ_GLOBAL``): The BPF scheduler
> +     is done with the task - it either goes straight to a CPU's local run
> +     queue or to the global DSQ as a fallback. The task never enters (or
> +     exits) BPF custody, and ``ops.dequeue()`` will not be called.
> +
> +   * **Dispatch to user-created DSQs** (custom DSQs): the task enters the
> +     BPF scheduler's custody. When the task later leaves BPF custody
> +     (dispatched to a terminal DSQ, picked by core-sched, or dequeued for
> +     sleep/property changes), ``ops.dequeue()`` will be called exactly once.
> +
> +   * **Queued on BPF side** (e.g., internal queues, no DSQ): The task is in
> +     BPF custody. ``ops.dequeue()`` will be called when it leaves (e.g.
> +     when ``ops.dispatch()`` moves it to a terminal DSQ, or on property
> +     change / sleep).
> +
> +   **NOTE**: this concept is valid also with the ``ops.select_cpu()``
> +   direct dispatch optimization. Even though it skips ``ops.enqueue()``
> +   invocation, if the task is dispatched to a user-created DSQ or internal
> +   BPF structure, it enters BPF custody and will get ``ops.dequeue()`` when
> +   it leaves. If dispatched to a terminal DSQ, the BPF scheduler is done
> +   with it immediately. This provides the performance benefit of avoiding
> +   the ``ops.enqueue()`` roundtrip while maintaining correct state
> +   tracking.
> +
> +   The dequeue can happen for different reasons, distinguished by flags:
> +
> +   1. **Regular dispatch**: when a task in BPF custody is dispatched to a
> +      terminal DSQ from ``ops.dispatch()`` (leaving BPF custody for
> +      execution), ``ops.dequeue()`` is triggered without any special flags.
> +
> +   2. **Core scheduling pick**: when ``CONFIG_SCHED_CORE`` is enabled and
> +      core scheduling picks a task for execution while it's still in BPF
> +      custody, ``ops.dequeue()`` is called with the
> +      ``SCX_DEQ_CORE_SCHED_EXEC`` flag.
> +
> +   3. **Scheduling property change**: when a task property changes (via
> +      operations like ``sched_setaffinity()``, ``sched_setscheduler()``,
> +      priority changes, CPU migrations, etc.) while the task is still in
> +      BPF custody, ``ops.dequeue()`` is called with the
> +      ``SCX_DEQ_SCHED_CHANGE`` flag set in ``deq_flags``.
> +
> +   **Important**: Once a task has left BPF custody (e.g. after being
> +   dispatched to a terminal DSQ), property changes will not trigger
> +   ``ops.dequeue()``, since the task is no longer being managed by the BPF
> +   scheduler.
> +
>  3. When a CPU is ready to schedule, it first looks at its local DSQ. If
>     empty, it then looks at the global DSQ. If there still isn't a task to
>     run, ``ops.dispatch()`` is invoked which can use the following two
> @@ -319,6 +375,8 @@ by a sched_ext scheduler:
>                  /* Any usable CPU becomes available */
>  
>                  ops.dispatch(); /* Task is moved to a local DSQ */
> +
> +                ops.dequeue(); /* Exiting BPF scheduler */
>              }
>              ops.running();      /* Task starts running on its assigned CPU */
>              while (task->scx.slice > 0 && task is runnable)
> diff --git a/include/linux/sched/ext.h b/include/linux/sched/ext.h
> index bcb962d5ee7d8..c48f818eee9b8 100644
> --- a/include/linux/sched/ext.h
> +++ b/include/linux/sched/ext.h
> @@ -84,6 +84,7 @@ struct scx_dispatch_q {
>  /* scx_entity.flags */
>  enum scx_ent_flags {
>  	SCX_TASK_QUEUED		= 1 << 0, /* on ext runqueue */
> +	SCX_TASK_NEED_DEQ	= 1 << 1, /* in BPF custody, needs ops.dequeue() when leaving */

Can we make this "SCX_TASK_IN_BPF"? Since we've now defined what it means to be
in BPF custody vs the core scx scheduler (terminal DSQs) this is a more
general property that can be useful to check in the future. An example:
We can now assert that a task's BPF state is consistent with its actual 
kernel state when using BPF-based data structures to manage tasks.

>  	SCX_TASK_RESET_RUNNABLE_AT = 1 << 2, /* runnable_at should be reset */
>  	SCX_TASK_DEQD_FOR_SLEEP	= 1 << 3, /* last dequeue was for SLEEP */
>  
> diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
> index 0bb8fa927e9e9..d17fd9141adf4 100644
> --- a/kernel/sched/ext.c
> +++ b/kernel/sched/ext.c
> @@ -925,6 +925,27 @@ static void touch_core_sched(struct rq *rq, struct task_struct *p)
>  #endif
>  }
>  
> +/**
> + * is_terminal_dsq - Check if a DSQ is terminal for ops.dequeue() purposes
> + * @dsq_id: DSQ ID to check
> + *
> + * Returns true if @dsq_id is a terminal/builtin DSQ where the BPF
> + * scheduler is considered "done" with the task.
> + *
> + * Builtin DSQs include:
> + *  - Local DSQs (%SCX_DSQ_LOCAL or %SCX_DSQ_LOCAL_ON): per-CPU queues
> + *    where tasks go directly to execution,
> + *  - Global DSQ (%SCX_DSQ_GLOBAL): built-in fallback queue,
> + *  - Bypass DSQ: used during bypass mode.
> + *
> + * Tasks dispatched to builtin DSQs exit BPF scheduler custody and do not
> + * trigger ops.dequeue() when they are later consumed.
> + */
> +static inline bool is_terminal_dsq(u64 dsq_id)
> +{
> +	return dsq_id & SCX_DSQ_FLAG_BUILTIN;
> +}
> +
>  /**
>   * touch_core_sched_dispatch - Update core-sched timestamp on dispatch
>   * @rq: rq to read clock from, must be locked
> @@ -1008,7 +1029,8 @@ static void local_dsq_post_enq(struct scx_dispatch_q *dsq, struct task_struct *p
>  		resched_curr(rq);
>  }
>  
> -static void dispatch_enqueue(struct scx_sched *sch, struct scx_dispatch_q *dsq,
> +static void dispatch_enqueue(struct scx_sched *sch, struct rq *rq,
> +			     struct scx_dispatch_q *dsq,
>  			     struct task_struct *p, u64 enq_flags)
>  {
>  	bool is_local = dsq->id == SCX_DSQ_LOCAL;
> @@ -1103,6 +1125,27 @@ static void dispatch_enqueue(struct scx_sched *sch, struct scx_dispatch_q *dsq,
>  	dsq_mod_nr(dsq, 1);
>  	p->scx.dsq = dsq;
>  
> +	/*
> +	 * Handle ops.dequeue() and custody tracking.
> +	 *
> +	 * Builtin DSQs (local, global, bypass) are terminal: the BPF
> +	 * scheduler is done with the task. If it was in BPF custody, call
> +	 * ops.dequeue() and clear the flag.
> +	 *
> +	 * User DSQs: Task is in BPF scheduler's custody. Set the flag so
> +	 * ops.dequeue() will be called when it leaves.
> +	 */
> +	if (SCX_HAS_OP(sch, dequeue)) {
> +		if (is_terminal_dsq(dsq->id)) {
> +			if (p->scx.flags & SCX_TASK_NEED_DEQ)
> +				SCX_CALL_OP_TASK(sch, SCX_KF_REST, dequeue,
> +						 rq, p, 0);
> +			p->scx.flags &= ~SCX_TASK_NEED_DEQ;
> +		} else {
> +			p->scx.flags |= SCX_TASK_NEED_DEQ;
> +		}
> +	}
> +
>  	/*
>  	 * scx.ddsp_dsq_id and scx.ddsp_enq_flags are only relevant on the
>  	 * direct dispatch path, but we clear them here because the direct
> @@ -1323,7 +1366,7 @@ static void direct_dispatch(struct scx_sched *sch, struct task_struct *p,
>  		return;
>  	}
>  
> -	dispatch_enqueue(sch, dsq, p,
> +	dispatch_enqueue(sch, rq, dsq, p,
>  			 p->scx.ddsp_enq_flags | SCX_ENQ_CLEAR_OPSS);
>  }
>  
> @@ -1407,13 +1450,22 @@ static void do_enqueue_task(struct rq *rq, struct task_struct *p, u64 enq_flags,
>  	 * dequeue may be waiting. The store_release matches their load_acquire.
>  	 */
>  	atomic_long_set_release(&p->scx.ops_state, SCX_OPSS_QUEUED | qseq);
> +
> +	/*
> +	 * Task is now in BPF scheduler's custody (queued on BPF internal
> +	 * structures). Set %SCX_TASK_NEED_DEQ so ops.dequeue() is called
> +	 * when it leaves custody (e.g. dispatched to a terminal DSQ or on
> +	 * property change).
> +	 */
> +	if (SCX_HAS_OP(sch, dequeue))

Related to the rename: Can we remove the guards and track the flag
regardless of whether ops.dequeue() is present?

There is no reason not to track whether a task is in BPF or the core, 
and it is a property that's independent of whether we implement ops.dequeue(). 
This also simplifies the code since we now just guard the actual ops.dequeue()
call.

> +		p->scx.flags |= SCX_TASK_NEED_DEQ;
>  	return;
>  
>  direct:
>  	direct_dispatch(sch, p, enq_flags);
>  	return;
>  local_norefill:
> -	dispatch_enqueue(sch, &rq->scx.local_dsq, p, enq_flags);
> +	dispatch_enqueue(sch, rq, &rq->scx.local_dsq, p, enq_flags);
>  	return;
>  local:
>  	dsq = &rq->scx.local_dsq;
> @@ -1433,7 +1485,7 @@ static void do_enqueue_task(struct rq *rq, struct task_struct *p, u64 enq_flags,
>  	 */
>  	touch_core_sched(rq, p);
>  	refill_task_slice_dfl(sch, p);
> -	dispatch_enqueue(sch, dsq, p, enq_flags);
> +	dispatch_enqueue(sch, rq, dsq, p, enq_flags);
>  }
>  
>  static bool task_runnable(const struct task_struct *p)
> @@ -1511,6 +1563,22 @@ static void enqueue_task_scx(struct rq *rq, struct task_struct *p, int enq_flags
>  		__scx_add_event(sch, SCX_EV_SELECT_CPU_FALLBACK, 1);
>  }
>  
> +/*
> + * Call ops.dequeue() for a task leaving BPF custody. Adds %SCX_DEQ_SCHED_CHANGE
> + * when the dequeue is due to a property change (not sleep or core-sched pick).
> + */
> +static void call_task_dequeue(struct scx_sched *sch, struct rq *rq,
> +			      struct task_struct *p, u64 deq_flags)
> +{
> +	u64 flags = deq_flags;
> +
> +	if (!(deq_flags & (DEQUEUE_SLEEP | SCX_DEQ_CORE_SCHED_EXEC)))
> +		flags |= SCX_DEQ_SCHED_CHANGE;
> +
> +	SCX_CALL_OP_TASK(sch, SCX_KF_REST, dequeue, rq, p, flags);
> +	p->scx.flags &= ~SCX_TASK_NEED_DEQ;
> +}
> +
>  static void ops_dequeue(struct rq *rq, struct task_struct *p, u64 deq_flags)
>  {
>  	struct scx_sched *sch = scx_root;
> @@ -1524,6 +1592,24 @@ static void ops_dequeue(struct rq *rq, struct task_struct *p, u64 deq_flags)
>  
>  	switch (opss & SCX_OPSS_STATE_MASK) {
>  	case SCX_OPSS_NONE:
> +		/*
> +		 * Task is not in BPF data structures (either dispatched to
> +		 * a DSQ or running). Only call ops.dequeue() if the task
> +		 * is still in BPF scheduler's custody (%SCX_TASK_NEED_DEQ
> +		 * is set).
> +		 *
> +		 * If the task has already been dispatched to a terminal
> +		 * DSQ (local DSQ or %SCX_DSQ_GLOBAL), it has left the BPF
> +		 * scheduler's custody and the flag will be clear, so we
> +		 * skip ops.dequeue().
> +		 *
> +		 * If this is a property change (not sleep/core-sched) and
> +		 * the task is still in BPF custody, set the
> +		 * %SCX_DEQ_SCHED_CHANGE flag.
> +		 */
> +		if (SCX_HAS_OP(sch, dequeue) &&
> +		    (p->scx.flags & SCX_TASK_NEED_DEQ))
> +			call_task_dequeue(sch, rq, p, deq_flags);
>  		break;
>  	case SCX_OPSS_QUEUEING:
>  		/*
> @@ -1532,9 +1618,14 @@ static void ops_dequeue(struct rq *rq, struct task_struct *p, u64 deq_flags)
>  		 */
>  		BUG();
>  	case SCX_OPSS_QUEUED:
> -		if (SCX_HAS_OP(sch, dequeue))
> -			SCX_CALL_OP_TASK(sch, SCX_KF_REST, dequeue, rq,
> -					 p, deq_flags);
> +		/*
> +		 * Task is still on the BPF scheduler (not dispatched yet).
> +		 * Call ops.dequeue() to notify it is leaving BPF custody.
> +		 */
> +		if (SCX_HAS_OP(sch, dequeue)) {
> +			WARN_ON_ONCE(!(p->scx.flags & SCX_TASK_NEED_DEQ));
> +			call_task_dequeue(sch, rq, p, deq_flags);
> +		}
>  
>  		if (atomic_long_try_cmpxchg(&p->scx.ops_state, &opss,
>  					    SCX_OPSS_NONE))
> @@ -1631,6 +1722,7 @@ static void move_local_task_to_local_dsq(struct task_struct *p, u64 enq_flags,
>  					 struct scx_dispatch_q *src_dsq,
>  					 struct rq *dst_rq)
>  {
> +	struct scx_sched *sch = scx_root;
>  	struct scx_dispatch_q *dst_dsq = &dst_rq->scx.local_dsq;
>  
>  	/* @dsq is locked and @p is on @dst_rq */
> @@ -1639,6 +1731,15 @@ static void move_local_task_to_local_dsq(struct task_struct *p, u64 enq_flags,
>  
>  	WARN_ON_ONCE(p->scx.holding_cpu >= 0);
>  
> +	/*
> +	 * Task is moving from a non-local DSQ to a local (terminal) DSQ.
> +	 * Call ops.dequeue() if the task was in BPF custody.
> +	 */
> +	if (SCX_HAS_OP(sch, dequeue) && (p->scx.flags & SCX_TASK_NEED_DEQ)) {
> +		SCX_CALL_OP_TASK(sch, SCX_KF_REST, dequeue, dst_rq, p, 0);
> +		p->scx.flags &= ~SCX_TASK_NEED_DEQ;
> +	}
> +
>  	if (enq_flags & (SCX_ENQ_HEAD | SCX_ENQ_PREEMPT))
>  		list_add(&p->scx.dsq_list.node, &dst_dsq->list);
>  	else
> @@ -1879,7 +1980,7 @@ static struct rq *move_task_between_dsqs(struct scx_sched *sch,
>  		dispatch_dequeue_locked(p, src_dsq);
>  		raw_spin_unlock(&src_dsq->lock);
>  
> -		dispatch_enqueue(sch, dst_dsq, p, enq_flags);
> +		dispatch_enqueue(sch, dst_rq, dst_dsq, p, enq_flags);
>  	}
>  
>  	return dst_rq;
> @@ -1969,14 +2070,14 @@ static void dispatch_to_local_dsq(struct scx_sched *sch, struct rq *rq,
>  	 * If dispatching to @rq that @p is already on, no lock dancing needed.
>  	 */
>  	if (rq == src_rq && rq == dst_rq) {
> -		dispatch_enqueue(sch, dst_dsq, p,
> +		dispatch_enqueue(sch, rq, dst_dsq, p,
>  				 enq_flags | SCX_ENQ_CLEAR_OPSS);
>  		return;
>  	}
>  
>  	if (src_rq != dst_rq &&
>  	    unlikely(!task_can_run_on_remote_rq(sch, p, dst_rq, true))) {
> -		dispatch_enqueue(sch, find_global_dsq(sch, p), p,
> +		dispatch_enqueue(sch, rq, find_global_dsq(sch, p), p,
>  				 enq_flags | SCX_ENQ_CLEAR_OPSS);
>  		return;
>  	}
> @@ -2014,9 +2115,21 @@ static void dispatch_to_local_dsq(struct scx_sched *sch, struct rq *rq,
>  		 */
>  		if (src_rq == dst_rq) {
>  			p->scx.holding_cpu = -1;
> -			dispatch_enqueue(sch, &dst_rq->scx.local_dsq, p,
> +			dispatch_enqueue(sch, dst_rq, &dst_rq->scx.local_dsq, p,
>  					 enq_flags);
>  		} else {
> +			/*
> +			 * Moving to a remote local DSQ. dispatch_enqueue() is
> +			 * not used (we go through deactivate/activate), so
> +			 * call ops.dequeue() here if the task was in BPF
> +			 * custody.
> +			 */
> +			if (SCX_HAS_OP(sch, dequeue) &&
> +			    (p->scx.flags & SCX_TASK_NEED_DEQ)) {
> +				SCX_CALL_OP_TASK(sch, SCX_KF_REST, dequeue,
> +						 src_rq, p, 0);
> +				p->scx.flags &= ~SCX_TASK_NEED_DEQ;
> +			}
>  			move_remote_task_to_local_dsq(p, enq_flags,
>  						      src_rq, dst_rq);
>  			/* task has been moved to dst_rq, which is now locked */
> @@ -2113,7 +2226,7 @@ static void finish_dispatch(struct scx_sched *sch, struct rq *rq,
>  	if (dsq->id == SCX_DSQ_LOCAL)
>  		dispatch_to_local_dsq(sch, rq, dsq, p, enq_flags);
>  	else
> -		dispatch_enqueue(sch, dsq, p, enq_flags | SCX_ENQ_CLEAR_OPSS);
> +		dispatch_enqueue(sch, rq, dsq, p, enq_flags | SCX_ENQ_CLEAR_OPSS);
>  }
>  
>  static void flush_dispatch_buf(struct scx_sched *sch, struct rq *rq)
> @@ -2414,7 +2527,7 @@ static void put_prev_task_scx(struct rq *rq, struct task_struct *p,
>  		 * DSQ.
>  		 */
>  		if (p->scx.slice && !scx_rq_bypassing(rq)) {
> -			dispatch_enqueue(sch, &rq->scx.local_dsq, p,
> +			dispatch_enqueue(sch, rq, &rq->scx.local_dsq, p,
>  					 SCX_ENQ_HEAD);
>  			goto switch_class;
>  		}
> @@ -2898,6 +3011,14 @@ static void scx_enable_task(struct task_struct *p)
>  
>  	lockdep_assert_rq_held(rq);
>  
> +	/*
> +	 * Verify the task is not in BPF scheduler's custody. If flag
> +	 * transitions are consistent, the flag should always be clear
> +	 * here.
> +	 */
> +	if (SCX_HAS_OP(sch, dequeue))
> +		WARN_ON_ONCE(p->scx.flags & SCX_TASK_NEED_DEQ);
> +
>  	/*
>  	 * Set the weight before calling ops.enable() so that the scheduler
>  	 * doesn't see a stale value if they inspect the task struct.
> @@ -2929,6 +3050,14 @@ static void scx_disable_task(struct task_struct *p)
>  	if (SCX_HAS_OP(sch, disable))
>  		SCX_CALL_OP_TASK(sch, SCX_KF_REST, disable, rq, p);
>  	scx_set_task_state(p, SCX_TASK_READY);
> +
> +	/*
> +	 * Verify the task is not in BPF scheduler's custody. If flag
> +	 * transitions are consistent, the flag should always be clear
> +	 * here.
> +	 */
> +	if (SCX_HAS_OP(sch, dequeue))
> +		WARN_ON_ONCE(p->scx.flags & SCX_TASK_NEED_DEQ);
>  }
>  
>  static void scx_exit_task(struct task_struct *p)
> @@ -3919,7 +4048,7 @@ static u32 bypass_lb_cpu(struct scx_sched *sch, struct rq *rq,
>  		 * between bypass DSQs.
>  		 */
>  		dispatch_dequeue_locked(p, donor_dsq);
> -		dispatch_enqueue(sch, donee_dsq, p, SCX_ENQ_NESTED);
> +		dispatch_enqueue(sch, donee_rq, donee_dsq, p, SCX_ENQ_NESTED);
>  
>  		/*
>  		 * $donee might have been idle and need to be woken up. No need
> diff --git a/kernel/sched/ext_internal.h b/kernel/sched/ext_internal.h
> index 386c677e4c9a0..befa9a5d6e53f 100644
> --- a/kernel/sched/ext_internal.h
> +++ b/kernel/sched/ext_internal.h
> @@ -982,6 +982,13 @@ enum scx_deq_flags {
>  	 * it hasn't been dispatched yet. Dequeue from the BPF side.
>  	 */
>  	SCX_DEQ_CORE_SCHED_EXEC	= 1LLU << 32,
> +
> +	/*
> +	 * The task is being dequeued due to a property change (e.g.,
> +	 * sched_setaffinity(), sched_setscheduler(), set_user_nice(),
> +	 * etc.).
> +	 */
> +	SCX_DEQ_SCHED_CHANGE	= 1LLU << 33,
>  };
>  
>  enum scx_pick_idle_cpu_flags {
> diff --git a/tools/sched_ext/include/scx/enum_defs.autogen.h b/tools/sched_ext/include/scx/enum_defs.autogen.h
> index c2c33df9292c2..dcc945304760f 100644
> --- a/tools/sched_ext/include/scx/enum_defs.autogen.h
> +++ b/tools/sched_ext/include/scx/enum_defs.autogen.h
> @@ -21,6 +21,7 @@
>  #define HAVE_SCX_CPU_PREEMPT_UNKNOWN
>  #define HAVE_SCX_DEQ_SLEEP
>  #define HAVE_SCX_DEQ_CORE_SCHED_EXEC
> +#define HAVE_SCX_DEQ_SCHED_CHANGE
>  #define HAVE_SCX_DSQ_FLAG_BUILTIN
>  #define HAVE_SCX_DSQ_FLAG_LOCAL_ON
>  #define HAVE_SCX_DSQ_INVALID
> diff --git a/tools/sched_ext/include/scx/enums.autogen.bpf.h b/tools/sched_ext/include/scx/enums.autogen.bpf.h
> index 2f8002bcc19ad..5da50f9376844 100644
> --- a/tools/sched_ext/include/scx/enums.autogen.bpf.h
> +++ b/tools/sched_ext/include/scx/enums.autogen.bpf.h
> @@ -127,3 +127,5 @@ const volatile u64 __SCX_ENQ_CLEAR_OPSS __weak;
>  const volatile u64 __SCX_ENQ_DSQ_PRIQ __weak;
>  #define SCX_ENQ_DSQ_PRIQ __SCX_ENQ_DSQ_PRIQ
>  
> +const volatile u64 __SCX_DEQ_SCHED_CHANGE __weak;
> +#define SCX_DEQ_SCHED_CHANGE __SCX_DEQ_SCHED_CHANGE
> diff --git a/tools/sched_ext/include/scx/enums.autogen.h b/tools/sched_ext/include/scx/enums.autogen.h
> index fedec938584be..fc9a7a4d9dea5 100644
> --- a/tools/sched_ext/include/scx/enums.autogen.h
> +++ b/tools/sched_ext/include/scx/enums.autogen.h
> @@ -46,4 +46,5 @@
>  	SCX_ENUM_SET(skel, scx_enq_flags, SCX_ENQ_LAST); \
>  	SCX_ENUM_SET(skel, scx_enq_flags, SCX_ENQ_CLEAR_OPSS); \
>  	SCX_ENUM_SET(skel, scx_enq_flags, SCX_ENQ_DSQ_PRIQ); \
> +	SCX_ENUM_SET(skel, scx_deq_flags, SCX_DEQ_SCHED_CHANGE); \
>  } while (0)


Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ