[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <978215f1-40d1-4c77-b436-2710029d5acc@igalia.com>
Date: Mon, 15 Sep 2025 16:22:20 +0900
From: Changwoo Min <changwoo@...lia.com>
To: Andrea Righi <arighi@...dia.com>, Tejun Heo <tj@...nel.org>,
David Vernet <void@...ifault.com>
Cc: Cheng-Yang Chou <yphbchou0911@...il.com>, sched-ext@...ts.linux.dev,
linux-kernel@...r.kernel.org
Subject: Re: [PATCH] Revert "sched_ext: Skip per-CPU tasks in
scx_bpf_reenqueue_local()"
Hi Andrea,
On 9/13/25 01:14, Andrea Righi wrote:
> scx_bpf_reenqueue_local() can be called from ops.cpu_release() when a
> CPU is taken by a higher scheduling class to give tasks queued to the
> CPU's local DSQ a chance to be migrated somewhere else, instead of
> waiting indefinitely for that CPU to become available again.
>
> In doing so, we decided to skip migration-disabled tasks, under the
> assumption that they cannot be migrated anyway.
>
> However, when a higher scheduling class preempts a CPU, the running task
> is always inserted at the head of the local DSQ as a migration-disabled
> task. This means it is always skipped by scx_bpf_reenqueue_local(), and
> ends up being confined to the same CPU even if that CPU is heavily
> contended by other higher scheduling class tasks.
>
> As an example, let's consider the following scenario:
>
> $ schedtool -a 0,1, -e yes > /dev/null
> $ sudo schedtool -F -p 99 -a 0, -e \
> stress-ng -c 1 --cpu-load 99 --cpu-load-slice 1000
>
> The first task (SCHED_EXT) can run on CPU0 or CPU1. The second task
> (SCHED_FIFO) is pinned to CPU0 and consumes ~99% of it. If the SCHED_EXT
> task initially runs on CPU0, it will remain there because it always sees
> CPU0 as "idle" in the short gaps left by the RT task, resulting in ~1%
> utilization while CPU1 stays idle:
>
> 0[||||||||||||||||||||||100.0%] 8[ 0.0%]
> 1[ 0.0%] 9[ 0.0%]
> 2[ 0.0%] 10[ 0.0%]
> 3[ 0.0%] 11[ 0.0%]
> 4[ 0.0%] 12[ 0.0%]
> 5[ 0.0%] 13[ 0.0%]
> 6[ 0.0%] 14[ 0.0%]
> 7[ 0.0%] 15[ 0.0%]
> PID USER PRI NI S CPU CPU%▽MEM% TIME+ Command
> 1067 root RT 0 R 0 99.0 0.2 0:31.16 stress-ng-cpu [run]
> 975 arighi 20 0 R 0 1.0 0.0 0:26.32 yes
>
> By allowing scx_bpf_reenqueue_local() to re-enqueue migration-disabled
> tasks, the scheduler can choose to migrate them to other CPUs (CPU1 in
> this case) via ops.enqueue(), leading to better CPU utilization:
>
> 0[||||||||||||||||||||||100.0%] 8[ 0.0%]
> 1[||||||||||||||||||||||100.0%] 9[ 0.0%]
> 2[ 0.0%] 10[ 0.0%]
> 3[ 0.0%] 11[ 0.0%]
> 4[ 0.0%] 12[ 0.0%]
> 5[ 0.0%] 13[ 0.0%]
> 6[ 0.0%] 14[ 0.0%]
> 7[ 0.0%] 15[ 0.0%]
> PID USER PRI NI S CPU CPU%▽MEM% TIME+ Command
> 577 root RT 0 R 0 100.0 0.2 0:23.17 stress-ng-cpu [run]
> 555 arighi 20 0 R 1 100.0 0.0 0:28.67 yes
>
> It's debatable whether per-CPU tasks should be re-enqueued as well, but
> doing so is probably safer: the scheduler can recognize re-enqueued
> tasks through the %SCX_ENQ_REENQ flag, reassess their placement, and
> either put them back at the head of the local DSQ or let another task
> attempt to take the CPU.
>
> This also prevents giving per-CPU tasks an implicit priority boost,
> which would otherwise make them more likely to reclaim CPUs preempted by
> higher scheduling classes.
>
> Fixes: 97e13ecb02668 ("sched_ext: Skip per-CPU tasks in scx_bpf_reenqueue_local()")
> Cc: stable@...r.kernel.org # v6.15+
> Signed-off-by: Andrea Righi <arighi@...dia.com>
> ---
> kernel/sched/ext.c | 6 +-----
> 1 file changed, 1 insertion(+), 5 deletions(-)
>
> diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
> index 236dce2fc13b4..4c3592e26ee45 100644
> --- a/kernel/sched/ext.c
> +++ b/kernel/sched/ext.c
> @@ -5726,12 +5726,8 @@ __bpf_kfunc u32 scx_bpf_reenqueue_local(void)
> * CPUs disagree, they use %ENQUEUE_RESTORE which is bypassed to
> * the current local DSQ for running tasks and thus are not
> * visible to the BPF scheduler.
> - *
> - * Also skip re-enqueueing tasks that can only run on this
> - * CPU, as they would just be re-added to the same local
> - * DSQ without any benefit.
> */
> - if (p->migration_pending || is_migration_disabled(p) || p->nr_cpus_allowed == 1)
> + if (p->migration_pending)
> continue;
I think it is okay to keep "p->nr_cpus_allowed == 1" to the
condition. When a task is pinned to a *single* CPU, there is no
other place for a scheduler to put the task. Additionally, adding
the condition is acceptable in your example of the first task
running on either CPU 0 or 1.
Regards,
Changwoo Min
Powered by blists - more mailing lists