linux-kernel - Re: [PATCH] Revert "sched_ext: Skip per-CPU tasks in scx_bpf_reenqueue

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <aMfC7sxzVYbRKYps@gpd4>
Date: Mon, 15 Sep 2025 09:40:30 +0200
From: Andrea Righi <arighi@...dia.com>
To: Changwoo Min <changwoo@...lia.com>
Cc: Tejun Heo <tj@...nel.org>, David Vernet <void@...ifault.com>,
	Cheng-Yang Chou <yphbchou0911@...il.com>, sched-ext@...ts.linux.dev,
	linux-kernel@...r.kernel.org
Subject: Re: [PATCH] Revert "sched_ext: Skip per-CPU tasks in
 scx_bpf_reenqueue_local()"

Hi Changwoo,

On Mon, Sep 15, 2025 at 04:22:20PM +0900, Changwoo Min wrote:
> Hi Andrea,
> 
> On 9/13/25 01:14, Andrea Righi wrote:
> > scx_bpf_reenqueue_local() can be called from ops.cpu_release() when a
> > CPU is taken by a higher scheduling class to give tasks queued to the
> > CPU's local DSQ a chance to be migrated somewhere else, instead of
> > waiting indefinitely for that CPU to become available again.
> > 
> > In doing so, we decided to skip migration-disabled tasks, under the
> > assumption that they cannot be migrated anyway.
> > 
> > However, when a higher scheduling class preempts a CPU, the running task
> > is always inserted at the head of the local DSQ as a migration-disabled
> > task. This means it is always skipped by scx_bpf_reenqueue_local(), and
> > ends up being confined to the same CPU even if that CPU is heavily
> > contended by other higher scheduling class tasks.
> > 
> > As an example, let's consider the following scenario:
> > 
> >   $ schedtool -a 0,1, -e yes > /dev/null
> >   $ sudo schedtool -F -p 99 -a 0, -e \
> >     stress-ng -c 1 --cpu-load 99 --cpu-load-slice 1000
> > 
> > The first task (SCHED_EXT) can run on CPU0 or CPU1. The second task
> > (SCHED_FIFO) is pinned to CPU0 and consumes ~99% of it. If the SCHED_EXT
> > task initially runs on CPU0, it will remain there because it always sees
> > CPU0 as "idle" in the short gaps left by the RT task, resulting in ~1%
> > utilization while CPU1 stays idle:
> > 
> >      0[||||||||||||||||||||||100.0%]   8[                        0.0%]
> >      1[                        0.0%]   9[                        0.0%]
> >      2[                        0.0%]  10[                        0.0%]
> >      3[                        0.0%]  11[                        0.0%]
> >      4[                        0.0%]  12[                        0.0%]
> >      5[                        0.0%]  13[                        0.0%]
> >      6[                        0.0%]  14[                        0.0%]
> >      7[                        0.0%]  15[                        0.0%]
> >    PID USER       PRI  NI  S CPU  CPU%▽MEM%   TIME+  Command
> >   1067 root        RT   0  R   0  99.0  0.2  0:31.16 stress-ng-cpu [run]
> >    975 arighi      20   0  R   0   1.0  0.0  0:26.32 yes
> > 
> > By allowing scx_bpf_reenqueue_local() to re-enqueue migration-disabled
> > tasks, the scheduler can choose to migrate them to other CPUs (CPU1 in
> > this case) via ops.enqueue(), leading to better CPU utilization:
> > 
> >      0[||||||||||||||||||||||100.0%]   8[                        0.0%]
> >      1[||||||||||||||||||||||100.0%]   9[                        0.0%]
> >      2[                        0.0%]  10[                        0.0%]
> >      3[                        0.0%]  11[                        0.0%]
> >      4[                        0.0%]  12[                        0.0%]
> >      5[                        0.0%]  13[                        0.0%]
> >      6[                        0.0%]  14[                        0.0%]
> >      7[                        0.0%]  15[                        0.0%]
> >    PID USER       PRI  NI  S CPU  CPU%▽MEM%   TIME+  Command
> >    577 root        RT   0  R   0 100.0  0.2  0:23.17 stress-ng-cpu [run]
> >    555 arighi      20   0  R   1 100.0  0.0  0:28.67 yes
> > 
> > It's debatable whether per-CPU tasks should be re-enqueued as well, but
> > doing so is probably safer: the scheduler can recognize re-enqueued
> > tasks through the %SCX_ENQ_REENQ flag, reassess their placement, and
> > either put them back at the head of the local DSQ or let another task
> > attempt to take the CPU.
> > 
> > This also prevents giving per-CPU tasks an implicit priority boost,
> > which would otherwise make them more likely to reclaim CPUs preempted by
> > higher scheduling classes.
> > 
> > Fixes: 97e13ecb02668 ("sched_ext: Skip per-CPU tasks in scx_bpf_reenqueue_local()")
> > Cc: stable@...r.kernel.org # v6.15+
> > Signed-off-by: Andrea Righi <arighi@...dia.com>
> > ---
> >   kernel/sched/ext.c | 6 +-----
> >   1 file changed, 1 insertion(+), 5 deletions(-)
> > 
> > diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
> > index 236dce2fc13b4..4c3592e26ee45 100644
> > --- a/kernel/sched/ext.c
> > +++ b/kernel/sched/ext.c
> > @@ -5726,12 +5726,8 @@ __bpf_kfunc u32 scx_bpf_reenqueue_local(void)
> >   		 * CPUs disagree, they use %ENQUEUE_RESTORE which is bypassed to
> >   		 * the current local DSQ for running tasks and thus are not
> >   		 * visible to the BPF scheduler.
> > -		 *
> > -		 * Also skip re-enqueueing tasks that can only run on this
> > -		 * CPU, as they would just be re-added to the same local
> > -		 * DSQ without any benefit.
> >   		 */
> > -		if (p->migration_pending || is_migration_disabled(p) || p->nr_cpus_allowed == 1)
> > +		if (p->migration_pending)
> >   			continue;
> 
> I think it is okay to keep "p->nr_cpus_allowed == 1" to the
> condition. When a task is pinned to a *single* CPU, there is no
> other place for a scheduler to put the task. Additionally, adding
> the condition is acceptable in your example of the first task
> running on either CPU 0 or 1.

Yeah, I was also conflicted about whether to keep `nr_cpus_allowed == 1` or
not. The main reason I lean toward re-enqueuing all tasks is that some
schedulers may want to re-evaluate other task properties when a CPU is
stolen, even if the target CPU doesn't change.

For instance, a scheduler that adjusts time slices in function of the
amount of waiting tasks may want to re-evaluate the assigned time slice
after a preemption from a higher scheduling class, even if the task is
bound to the same CPU.

There's also the fact that the running task is added back to the head of
the local DSQ on preemption from a higher scheduling class. If it's a
per-CPU task with a long time slice assigned and we don't re-enqueue it, it
may block more critical tasks that have arrived in the meantime.

Overall, I think the re-enqueue cost is small and it keeps the door open
for more flexible policies. What do you think?

Thanks,
-Andrea