linux-kernel - Re: [PATCH] Revert "sched_ext: Skip per-CPU tasks in scx_bpf_reenqueue

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <978215f1-40d1-4c77-b436-2710029d5acc@igalia.com>
Date: Mon, 15 Sep 2025 16:22:20 +0900
From: Changwoo Min <changwoo@...lia.com>
To: Andrea Righi <arighi@...dia.com>, Tejun Heo <tj@...nel.org>,
 David Vernet <void@...ifault.com>
Cc: Cheng-Yang Chou <yphbchou0911@...il.com>, sched-ext@...ts.linux.dev,
 linux-kernel@...r.kernel.org
Subject: Re: [PATCH] Revert "sched_ext: Skip per-CPU tasks in
 scx_bpf_reenqueue_local()"

Hi Andrea,

On 9/13/25 01:14, Andrea Righi wrote:
> scx_bpf_reenqueue_local() can be called from ops.cpu_release() when a
> CPU is taken by a higher scheduling class to give tasks queued to the
> CPU's local DSQ a chance to be migrated somewhere else, instead of
> waiting indefinitely for that CPU to become available again.
> 
> In doing so, we decided to skip migration-disabled tasks, under the
> assumption that they cannot be migrated anyway.
> 
> However, when a higher scheduling class preempts a CPU, the running task
> is always inserted at the head of the local DSQ as a migration-disabled
> task. This means it is always skipped by scx_bpf_reenqueue_local(), and
> ends up being confined to the same CPU even if that CPU is heavily
> contended by other higher scheduling class tasks.
> 
> As an example, let's consider the following scenario:
> 
>   $ schedtool -a 0,1, -e yes > /dev/null
>   $ sudo schedtool -F -p 99 -a 0, -e \
>     stress-ng -c 1 --cpu-load 99 --cpu-load-slice 1000
> 
> The first task (SCHED_EXT) can run on CPU0 or CPU1. The second task
> (SCHED_FIFO) is pinned to CPU0 and consumes ~99% of it. If the SCHED_EXT
> task initially runs on CPU0, it will remain there because it always sees
> CPU0 as "idle" in the short gaps left by the RT task, resulting in ~1%
> utilization while CPU1 stays idle:
> 
>      0[||||||||||||||||||||||100.0%]   8[                        0.0%]
>      1[                        0.0%]   9[                        0.0%]
>      2[                        0.0%]  10[                        0.0%]
>      3[                        0.0%]  11[                        0.0%]
>      4[                        0.0%]  12[                        0.0%]
>      5[                        0.0%]  13[                        0.0%]
>      6[                        0.0%]  14[                        0.0%]
>      7[                        0.0%]  15[                        0.0%]
>    PID USER       PRI  NI  S CPU  CPU%▽MEM%   TIME+  Command
>   1067 root        RT   0  R   0  99.0  0.2  0:31.16 stress-ng-cpu [run]
>    975 arighi      20   0  R   0   1.0  0.0  0:26.32 yes
> 
> By allowing scx_bpf_reenqueue_local() to re-enqueue migration-disabled
> tasks, the scheduler can choose to migrate them to other CPUs (CPU1 in
> this case) via ops.enqueue(), leading to better CPU utilization:
> 
>      0[||||||||||||||||||||||100.0%]   8[                        0.0%]
>      1[||||||||||||||||||||||100.0%]   9[                        0.0%]
>      2[                        0.0%]  10[                        0.0%]
>      3[                        0.0%]  11[                        0.0%]
>      4[                        0.0%]  12[                        0.0%]
>      5[                        0.0%]  13[                        0.0%]
>      6[                        0.0%]  14[                        0.0%]
>      7[                        0.0%]  15[                        0.0%]
>    PID USER       PRI  NI  S CPU  CPU%▽MEM%   TIME+  Command
>    577 root        RT   0  R   0 100.0  0.2  0:23.17 stress-ng-cpu [run]
>    555 arighi      20   0  R   1 100.0  0.0  0:28.67 yes
> 
> It's debatable whether per-CPU tasks should be re-enqueued as well, but
> doing so is probably safer: the scheduler can recognize re-enqueued
> tasks through the %SCX_ENQ_REENQ flag, reassess their placement, and
> either put them back at the head of the local DSQ or let another task
> attempt to take the CPU.
> 
> This also prevents giving per-CPU tasks an implicit priority boost,
> which would otherwise make them more likely to reclaim CPUs preempted by
> higher scheduling classes.
> 
> Fixes: 97e13ecb02668 ("sched_ext: Skip per-CPU tasks in scx_bpf_reenqueue_local()")
> Cc: stable@...r.kernel.org # v6.15+
> Signed-off-by: Andrea Righi <arighi@...dia.com>
> ---
>   kernel/sched/ext.c | 6 +-----
>   1 file changed, 1 insertion(+), 5 deletions(-)
> 
> diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
> index 236dce2fc13b4..4c3592e26ee45 100644
> --- a/kernel/sched/ext.c
> +++ b/kernel/sched/ext.c
> @@ -5726,12 +5726,8 @@ __bpf_kfunc u32 scx_bpf_reenqueue_local(void)
>   		 * CPUs disagree, they use %ENQUEUE_RESTORE which is bypassed to
>   		 * the current local DSQ for running tasks and thus are not
>   		 * visible to the BPF scheduler.
> -		 *
> -		 * Also skip re-enqueueing tasks that can only run on this
> -		 * CPU, as they would just be re-added to the same local
> -		 * DSQ without any benefit.
>   		 */
> -		if (p->migration_pending || is_migration_disabled(p) || p->nr_cpus_allowed == 1)
> +		if (p->migration_pending)
>   			continue;

I think it is okay to keep "p->nr_cpus_allowed == 1" to the
condition. When a task is pinned to a *single* CPU, there is no
other place for a scheduler to put the task. Additionally, adding
the condition is acceptable in your example of the first task
running on either CPU 0 or 1.

Regards,
Changwoo Min