linux-kernel - Re: [PATCH v2 04/14] sched_ext: Use per-CPU DSQs instead of per-node global DSQs in bypass mode

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <CABFh=a6rP08+vsK11Ubi5qv2o2yOYFSoiMMM8ZahSVy=LzXpow@mail.gmail.com>
Date: Mon, 10 Nov 2025 16:43:23 -0500
From: Emil Tsalapatis <linux-lists@...alapatis.com>
To: Tejun Heo <tj@...nel.org>
Cc: David Vernet <void@...ifault.com>, Andrea Righi <andrea.righi@...ux.dev>, 
	Changwoo Min <changwoo@...lia.com>, Dan Schatzberg <schatzberg.dan@...il.com>, 
	Emil Tsalapatis <etsal@...a.com>, sched-ext@...ts.linux.dev, linux-kernel@...r.kernel.org, 
	Andrea Righi <arighi@...dia.com>
Subject: Re: [PATCH v2 04/14] sched_ext: Use per-CPU DSQs instead of per-node
 global DSQs in bypass mode

On Mon, Nov 10, 2025 at 3:56 PM Tejun Heo <tj@...nel.org> wrote:
>
> Bypass mode routes tasks through fallback dispatch queues. Originally a single
> global DSQ, b7b3b2dbae73 ("sched_ext: Split the global DSQ per NUMA node")
> changed this to per-node DSQs to resolve NUMA-related livelocks.
>
> Dan Schatzberg found per-node DSQs can still livelock when many threads are
> pinned to different small CPU subsets: each CPU must scan many incompatible
> tasks to find runnable ones, causing severe contention with high CPU counts.
>
> Switch to per-CPU bypass DSQs. Each task queues on its current CPU. Default
> idle CPU selection and direct dispatch handle most cases well.
>
> This introduces a failure mode when tasks concentrate on one CPU in
> over-saturated systems. If the BPF scheduler severely skews placement before
> triggering bypass, that CPU's queue may be too long to drain, causing RCU
> stalls. A load balancer in a future patch will address this. The bypass DSQ is
> separate from local DSQ to enable load balancing: local DSQs use rq locks,
> preventing efficient scanning and transfer across CPUs, especially problematic
> when systems are already contended.
>
> v2: Clarified why bypass DSQ is separate from local DSQ (Andrea Righi).
>
> Reported-by: Dan Schatzberg <schatzberg.dan@...il.com>
> Cc: Emil Tsalapatis <etsal@...a.com>
> Reviewed-by: Andrea Righi <arighi@...dia.com>
> Signed-off-by: Tejun Heo <tj@...nel.org>
> ---

Reviewed-by: Emil Tsalapatis <emil@...alapatis.com>

>  include/linux/sched/ext.h |  1 +
>  kernel/sched/ext.c        | 16 +++++++++++++---
>  kernel/sched/sched.h      |  1 +
>  3 files changed, 15 insertions(+), 3 deletions(-)
>
> diff --git a/include/linux/sched/ext.h b/include/linux/sched/ext.h
> index 60285c3d07cf..3d3216ff9188 100644
> --- a/include/linux/sched/ext.h
> +++ b/include/linux/sched/ext.h
> @@ -57,6 +57,7 @@ enum scx_dsq_id_flags {
>         SCX_DSQ_INVALID         = SCX_DSQ_FLAG_BUILTIN | 0,
>         SCX_DSQ_GLOBAL          = SCX_DSQ_FLAG_BUILTIN | 1,
>         SCX_DSQ_LOCAL           = SCX_DSQ_FLAG_BUILTIN | 2,
> +       SCX_DSQ_BYPASS          = SCX_DSQ_FLAG_BUILTIN | 3,
>         SCX_DSQ_LOCAL_ON        = SCX_DSQ_FLAG_BUILTIN | SCX_DSQ_FLAG_LOCAL_ON,
>         SCX_DSQ_LOCAL_CPU_MASK  = 0xffffffffLLU,
>  };
> diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
> index b18864655d3a..4e128b139e7c 100644
> --- a/kernel/sched/ext.c
> +++ b/kernel/sched/ext.c
> @@ -1298,7 +1298,7 @@ static void do_enqueue_task(struct rq *rq, struct task_struct *p, u64 enq_flags,
>
>         if (scx_rq_bypassing(rq)) {
>                 __scx_add_event(sch, SCX_EV_BYPASS_DISPATCH, 1);

Nit: The bypass label has a single statement, and there is no fallthrough to it.
Can we just add the logic here:

dsq = &task_rq(p)->scx.bypass_dsq;
goto enqueue;

and remove the new label?

> -               goto global;
> +               goto bypass;
>         }
>
>         if (p->scx.ddsp_dsq_id != SCX_DSQ_INVALID)
> @@ -1356,6 +1356,9 @@ static void do_enqueue_task(struct rq *rq, struct task_struct *p, u64 enq_flags,
>  global:
>         dsq = find_global_dsq(sch, p);
>         goto enqueue;
> +bypass:
> +       dsq = &task_rq(p)->scx.bypass_dsq;

Nit: If we keep the bypass label, we can remove the goto since the
label is right below. Otherwise, we could remove it

> +       goto enqueue;
>
>  enqueue:
>         /*
> @@ -2154,8 +2157,14 @@ static int balance_one(struct rq *rq, struct task_struct *prev)
>         if (consume_global_dsq(sch, rq))
>                 goto has_tasks;
>
> -       if (unlikely(!SCX_HAS_OP(sch, dispatch)) ||
> -           scx_rq_bypassing(rq) || !scx_rq_online(rq))
> +       if (scx_rq_bypassing(rq)) {
> +               if (consume_dispatch_q(sch, rq, &rq->scx.bypass_dsq))
> +                       goto has_tasks;
> +               else
> +                       goto no_tasks;
> +       }
> +
> +       if (unlikely(!SCX_HAS_OP(sch, dispatch)) || !scx_rq_online(rq))
>                 goto no_tasks;
>
>         dspc->rq = rq;
> @@ -5367,6 +5376,7 @@ void __init init_sched_ext_class(void)
>                 int  n = cpu_to_node(cpu);
>
>                 init_dsq(&rq->scx.local_dsq, SCX_DSQ_LOCAL);
> +               init_dsq(&rq->scx.bypass_dsq, SCX_DSQ_BYPASS);
>                 INIT_LIST_HEAD(&rq->scx.runnable_list);
>                 INIT_LIST_HEAD(&rq->scx.ddsp_deferred_locals);
>
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index 27aae2a298f8..5991133a4849 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -808,6 +808,7 @@ struct scx_rq {
>         struct balance_callback deferred_bal_cb;
>         struct irq_work         deferred_irq_work;
>         struct irq_work         kick_cpus_irq_work;
> +       struct scx_dispatch_q   bypass_dsq;
>  };
>  #endif /* CONFIG_SCHED_CLASS_EXT */
>
> --
> 2.51.2
>
>