[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <aRGXd0QwgqBVu7Gq@gpd4>
Date: Mon, 10 Nov 2025 08:42:47 +0100
From: Andrea Righi <andrea.righi@...ux.dev>
To: Tejun Heo <tj@...nel.org>
Cc: David Vernet <void@...ifault.com>, Changwoo Min <changwoo@...lia.com>,
Dan Schatzberg <schatzberg.dan@...il.com>,
Emil Tsalapatis <etsal@...a.com>, sched-ext@...ts.linux.dev,
linux-kernel@...r.kernel.org
Subject: Re: [PATCH 04/13] sched_ext: Use per-CPU DSQs instead of per-node
global DSQs in bypass mode
Hi Tejun,
On Sun, Nov 09, 2025 at 08:31:03AM -1000, Tejun Heo wrote:
> When bypass mode is activated, tasks are routed through a fallback dispatch
> queue instead of the BPF scheduler. Originally, bypass mode used a single
> global DSQ, but this didn't scale well on NUMA machines and could lead to
> livelocks. In b7b3b2dbae73 ("sched_ext: Split the global DSQ per NUMA node"),
> this was changed to use per-node global DSQs, which resolved the
> cross-node-related livelocks.
>
> However, Dan Schatzberg found that per-node global DSQ can also livelock in a
> different scenario: On a NUMA node with many CPUs and many threads pinned to
> different small subsets of CPUs, each CPU often has to scan through many tasks
> it cannot run to find the one task it can run. With a high number of CPUs,
> this scanning overhead can easily cause livelocks.
>
> Change bypass mode to use dedicated per-CPU bypass DSQs. Each task is queued
> on the CPU that it's currently on. Because the default idle CPU selection
> policy and direct dispatch are both active during bypass, this works well in
> most cases including the above.
Is there any reason not to reuse rq->scx.local_dsq for this?
Thanks,
-Andrea
>
> However, this does have a failure mode in highly over-saturated systems where
> tasks are concentrated on a single CPU. If the BPF scheduler places most tasks
> on one CPU and then triggers bypass mode, bypass mode will keep those tasks on
> that one CPU, which can lead to failures such as RCU stalls as the queue may be
> too long for that CPU to drain in a reasonable time. This will be addressed
> with a load balancer in a future patch. The bypass DSQ is kept separate from
> the local DSQ to allow the load balancer to move tasks between bypass DSQs.
>
> Reported-by: Dan Schatzberg <schatzberg.dan@...il.com>
> Cc: Emil Tsalapatis <etsal@...a.com>
> Signed-off-by: Tejun Heo <tj@...nel.org>
> ---
> include/linux/sched/ext.h | 1 +
> kernel/sched/ext.c | 16 +++++++++++++---
> kernel/sched/sched.h | 1 +
> 3 files changed, 15 insertions(+), 3 deletions(-)
>
> diff --git a/include/linux/sched/ext.h b/include/linux/sched/ext.h
> index 9f5b0f2be310..e1502faf6241 100644
> --- a/include/linux/sched/ext.h
> +++ b/include/linux/sched/ext.h
> @@ -57,6 +57,7 @@ enum scx_dsq_id_flags {
> SCX_DSQ_INVALID = SCX_DSQ_FLAG_BUILTIN | 0,
> SCX_DSQ_GLOBAL = SCX_DSQ_FLAG_BUILTIN | 1,
> SCX_DSQ_LOCAL = SCX_DSQ_FLAG_BUILTIN | 2,
> + SCX_DSQ_BYPASS = SCX_DSQ_FLAG_BUILTIN | 3,
> SCX_DSQ_LOCAL_ON = SCX_DSQ_FLAG_BUILTIN | SCX_DSQ_FLAG_LOCAL_ON,
> SCX_DSQ_LOCAL_CPU_MASK = 0xffffffffLLU,
> };
> diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
> index a29bfadde89d..4b8b91494947 100644
> --- a/kernel/sched/ext.c
> +++ b/kernel/sched/ext.c
> @@ -1301,7 +1301,7 @@ static void do_enqueue_task(struct rq *rq, struct task_struct *p, u64 enq_flags,
>
> if (scx_rq_bypassing(rq)) {
> __scx_add_event(sch, SCX_EV_BYPASS_DISPATCH, 1);
> - goto global;
> + goto bypass;
> }
>
> if (p->scx.ddsp_dsq_id != SCX_DSQ_INVALID)
> @@ -1359,6 +1359,9 @@ static void do_enqueue_task(struct rq *rq, struct task_struct *p, u64 enq_flags,
> global:
> dsq = find_global_dsq(sch, p);
> goto enqueue;
> +bypass:
> + dsq = &task_rq(p)->scx.bypass_dsq;
> + goto enqueue;
>
> enqueue:
> /*
> @@ -2157,8 +2160,14 @@ static int balance_one(struct rq *rq, struct task_struct *prev)
> if (consume_global_dsq(sch, rq))
> goto has_tasks;
>
> - if (unlikely(!SCX_HAS_OP(sch, dispatch)) ||
> - scx_rq_bypassing(rq) || !scx_rq_online(rq))
> + if (scx_rq_bypassing(rq)) {
> + if (consume_dispatch_q(sch, rq, &rq->scx.bypass_dsq))
> + goto has_tasks;
> + else
> + goto no_tasks;
> + }
> +
> + if (unlikely(!SCX_HAS_OP(sch, dispatch)) || !scx_rq_online(rq))
> goto no_tasks;
>
> dspc->rq = rq;
> @@ -5370,6 +5379,7 @@ void __init init_sched_ext_class(void)
> int n = cpu_to_node(cpu);
>
> init_dsq(&rq->scx.local_dsq, SCX_DSQ_LOCAL);
> + init_dsq(&rq->scx.bypass_dsq, SCX_DSQ_BYPASS);
> INIT_LIST_HEAD(&rq->scx.runnable_list);
> INIT_LIST_HEAD(&rq->scx.ddsp_deferred_locals);
>
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index 27aae2a298f8..5991133a4849 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -808,6 +808,7 @@ struct scx_rq {
> struct balance_callback deferred_bal_cb;
> struct irq_work deferred_irq_work;
> struct irq_work kick_cpus_irq_work;
> + struct scx_dispatch_q bypass_dsq;
> };
> #endif /* CONFIG_SCHED_CLASS_EXT */
>
> --
> 2.51.1
>
Powered by blists - more mailing lists