linux-kernel - [PATCH 04/13] sched_ext: Use per-CPU DSQs instead of per-node global DSQs in bypass mode

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20251109183112.2412147-5-tj@kernel.org>
Date: Sun,  9 Nov 2025 08:31:03 -1000
From: Tejun Heo <tj@...nel.org>
To: David Vernet <void@...ifault.com>,
	Andrea Righi <andrea.righi@...ux.dev>,
	Changwoo Min <changwoo@...lia.com>
Cc: Dan Schatzberg <schatzberg.dan@...il.com>,
	Emil Tsalapatis <etsal@...a.com>,
	sched-ext@...ts.linux.dev,
	linux-kernel@...r.kernel.org,
	Tejun Heo <tj@...nel.org>
Subject: [PATCH 04/13] sched_ext: Use per-CPU DSQs instead of per-node global DSQs in bypass mode

When bypass mode is activated, tasks are routed through a fallback dispatch
queue instead of the BPF scheduler. Originally, bypass mode used a single
global DSQ, but this didn't scale well on NUMA machines and could lead to
livelocks. In b7b3b2dbae73 ("sched_ext: Split the global DSQ per NUMA node"),
this was changed to use per-node global DSQs, which resolved the
cross-node-related livelocks.

However, Dan Schatzberg found that per-node global DSQ can also livelock in a
different scenario: On a NUMA node with many CPUs and many threads pinned to
different small subsets of CPUs, each CPU often has to scan through many tasks
it cannot run to find the one task it can run. With a high number of CPUs,
this scanning overhead can easily cause livelocks.

Change bypass mode to use dedicated per-CPU bypass DSQs. Each task is queued
on the CPU that it's currently on. Because the default idle CPU selection
policy and direct dispatch are both active during bypass, this works well in
most cases including the above.

However, this does have a failure mode in highly over-saturated systems where
tasks are concentrated on a single CPU. If the BPF scheduler places most tasks
on one CPU and then triggers bypass mode, bypass mode will keep those tasks on
that one CPU, which can lead to failures such as RCU stalls as the queue may be
too long for that CPU to drain in a reasonable time. This will be addressed
with a load balancer in a future patch. The bypass DSQ is kept separate from
the local DSQ to allow the load balancer to move tasks between bypass DSQs.

Reported-by: Dan Schatzberg <schatzberg.dan@...il.com>
Cc: Emil Tsalapatis <etsal@...a.com>
Signed-off-by: Tejun Heo <tj@...nel.org>
---
 include/linux/sched/ext.h |  1 +
 kernel/sched/ext.c        | 16 +++++++++++++---
 kernel/sched/sched.h      |  1 +
 3 files changed, 15 insertions(+), 3 deletions(-)

diff --git a/include/linux/sched/ext.h b/include/linux/sched/ext.h
index 9f5b0f2be310..e1502faf6241 100644
--- a/include/linux/sched/ext.h
+++ b/include/linux/sched/ext.h
@@ -57,6 +57,7 @@ enum scx_dsq_id_flags {
 	SCX_DSQ_INVALID		= SCX_DSQ_FLAG_BUILTIN | 0,
 	SCX_DSQ_GLOBAL		= SCX_DSQ_FLAG_BUILTIN | 1,
 	SCX_DSQ_LOCAL		= SCX_DSQ_FLAG_BUILTIN | 2,
+	SCX_DSQ_BYPASS		= SCX_DSQ_FLAG_BUILTIN | 3,
 	SCX_DSQ_LOCAL_ON	= SCX_DSQ_FLAG_BUILTIN | SCX_DSQ_FLAG_LOCAL_ON,
 	SCX_DSQ_LOCAL_CPU_MASK	= 0xffffffffLLU,
 };
diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
index a29bfadde89d..4b8b91494947 100644
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -1301,7 +1301,7 @@ static void do_enqueue_task(struct rq *rq, struct task_struct *p, u64 enq_flags,

 	if (scx_rq_bypassing(rq)) {
 		__scx_add_event(sch, SCX_EV_BYPASS_DISPATCH, 1);
-		goto global;
+		goto bypass;
 	}

 	if (p->scx.ddsp_dsq_id != SCX_DSQ_INVALID)
@@ -1359,6 +1359,9 @@ static void do_enqueue_task(struct rq *rq, struct task_struct *p, u64 enq_flags,
 global:
 	dsq = find_global_dsq(sch, p);
 	goto enqueue;
+bypass:
+	dsq = &task_rq(p)->scx.bypass_dsq;
+	goto enqueue;

 enqueue:
 	/*
@@ -2157,8 +2160,14 @@ static int balance_one(struct rq *rq, struct task_struct *prev)
 	if (consume_global_dsq(sch, rq))
 		goto has_tasks;

-	if (unlikely(!SCX_HAS_OP(sch, dispatch)) ||
-	    scx_rq_bypassing(rq) || !scx_rq_online(rq))
+	if (scx_rq_bypassing(rq)) {
+		if (consume_dispatch_q(sch, rq, &rq->scx.bypass_dsq))
+			goto has_tasks;
+		else
+			goto no_tasks;
+	}
+
+	if (unlikely(!SCX_HAS_OP(sch, dispatch)) || !scx_rq_online(rq))
 		goto no_tasks;

 	dspc->rq = rq;
@@ -5370,6 +5379,7 @@ void __init init_sched_ext_class(void)
 		int  n = cpu_to_node(cpu);

 		init_dsq(&rq->scx.local_dsq, SCX_DSQ_LOCAL);
+		init_dsq(&rq->scx.bypass_dsq, SCX_DSQ_BYPASS);
 		INIT_LIST_HEAD(&rq->scx.runnable_list);
 		INIT_LIST_HEAD(&rq->scx.ddsp_deferred_locals);

diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 27aae2a298f8..5991133a4849 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -808,6 +808,7 @@ struct scx_rq {
 	struct balance_callback	deferred_bal_cb;
 	struct irq_work		deferred_irq_work;
 	struct irq_work		kick_cpus_irq_work;
+	struct scx_dispatch_q	bypass_dsq;
 };
 #endif /* CONFIG_SCHED_CLASS_EXT */

-- 
2.51.1