[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <73e0c4f7-7a47-4ecb-b9b6-cd20ac982ff3@igalia.com>
Date: Tue, 16 Sep 2025 11:05:50 +0900
From: Changwoo Min <changwoo@...lia.com>
To: Andrea Righi <arighi@...dia.com>
Cc: Tejun Heo <tj@...nel.org>, David Vernet <void@...ifault.com>,
Cheng-Yang Chou <yphbchou0911@...il.com>, sched-ext@...ts.linux.dev,
linux-kernel@...r.kernel.org
Subject: Re: [PATCH] Revert "sched_ext: Skip per-CPU tasks in
scx_bpf_reenqueue_local()"
Hi Andrea,
On 9/15/25 16:40, Andrea Righi wrote:
> Hi Changwoo,
>
> On Mon, Sep 15, 2025 at 04:22:20PM +0900, Changwoo Min wrote:
>> Hi Andrea,
>>
>> On 9/13/25 01:14, Andrea Righi wrote:
>>> scx_bpf_reenqueue_local() can be called from ops.cpu_release() when a
>>> CPU is taken by a higher scheduling class to give tasks queued to the
>>> CPU's local DSQ a chance to be migrated somewhere else, instead of
>>> waiting indefinitely for that CPU to become available again.
>>>
>>> In doing so, we decided to skip migration-disabled tasks, under the
>>> assumption that they cannot be migrated anyway.
>>>
>>> However, when a higher scheduling class preempts a CPU, the running task
>>> is always inserted at the head of the local DSQ as a migration-disabled
>>> task. This means it is always skipped by scx_bpf_reenqueue_local(), and
>>> ends up being confined to the same CPU even if that CPU is heavily
>>> contended by other higher scheduling class tasks.
>>>
>>> As an example, let's consider the following scenario:
>>>
>>> $ schedtool -a 0,1, -e yes > /dev/null
>>> $ sudo schedtool -F -p 99 -a 0, -e \
>>> stress-ng -c 1 --cpu-load 99 --cpu-load-slice 1000
>>>
>>> The first task (SCHED_EXT) can run on CPU0 or CPU1. The second task
>>> (SCHED_FIFO) is pinned to CPU0 and consumes ~99% of it. If the SCHED_EXT
>>> task initially runs on CPU0, it will remain there because it always sees
>>> CPU0 as "idle" in the short gaps left by the RT task, resulting in ~1%
>>> utilization while CPU1 stays idle:
>>>
>>> 0[||||||||||||||||||||||100.0%] 8[ 0.0%]
>>> 1[ 0.0%] 9[ 0.0%]
>>> 2[ 0.0%] 10[ 0.0%]
>>> 3[ 0.0%] 11[ 0.0%]
>>> 4[ 0.0%] 12[ 0.0%]
>>> 5[ 0.0%] 13[ 0.0%]
>>> 6[ 0.0%] 14[ 0.0%]
>>> 7[ 0.0%] 15[ 0.0%]
>>> PID USER PRI NI S CPU CPU%▽MEM% TIME+ Command
>>> 1067 root RT 0 R 0 99.0 0.2 0:31.16 stress-ng-cpu [run]
>>> 975 arighi 20 0 R 0 1.0 0.0 0:26.32 yes
>>>
>>> By allowing scx_bpf_reenqueue_local() to re-enqueue migration-disabled
>>> tasks, the scheduler can choose to migrate them to other CPUs (CPU1 in
>>> this case) via ops.enqueue(), leading to better CPU utilization:
>>>
>>> 0[||||||||||||||||||||||100.0%] 8[ 0.0%]
>>> 1[||||||||||||||||||||||100.0%] 9[ 0.0%]
>>> 2[ 0.0%] 10[ 0.0%]
>>> 3[ 0.0%] 11[ 0.0%]
>>> 4[ 0.0%] 12[ 0.0%]
>>> 5[ 0.0%] 13[ 0.0%]
>>> 6[ 0.0%] 14[ 0.0%]
>>> 7[ 0.0%] 15[ 0.0%]
>>> PID USER PRI NI S CPU CPU%▽MEM% TIME+ Command
>>> 577 root RT 0 R 0 100.0 0.2 0:23.17 stress-ng-cpu [run]
>>> 555 arighi 20 0 R 1 100.0 0.0 0:28.67 yes
>>>
>>> It's debatable whether per-CPU tasks should be re-enqueued as well, but
>>> doing so is probably safer: the scheduler can recognize re-enqueued
>>> tasks through the %SCX_ENQ_REENQ flag, reassess their placement, and
>>> either put them back at the head of the local DSQ or let another task
>>> attempt to take the CPU.
>>>
>>> This also prevents giving per-CPU tasks an implicit priority boost,
>>> which would otherwise make them more likely to reclaim CPUs preempted by
>>> higher scheduling classes.
>>>
>>> Fixes: 97e13ecb02668 ("sched_ext: Skip per-CPU tasks in scx_bpf_reenqueue_local()")
>>> Cc: stable@...r.kernel.org # v6.15+
>>> Signed-off-by: Andrea Righi <arighi@...dia.com>
>>> ---
>>> kernel/sched/ext.c | 6 +-----
>>> 1 file changed, 1 insertion(+), 5 deletions(-)
>>>
>>> diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
>>> index 236dce2fc13b4..4c3592e26ee45 100644
>>> --- a/kernel/sched/ext.c
>>> +++ b/kernel/sched/ext.c
>>> @@ -5726,12 +5726,8 @@ __bpf_kfunc u32 scx_bpf_reenqueue_local(void)
>>> * CPUs disagree, they use %ENQUEUE_RESTORE which is bypassed to
>>> * the current local DSQ for running tasks and thus are not
>>> * visible to the BPF scheduler.
>>> - *
>>> - * Also skip re-enqueueing tasks that can only run on this
>>> - * CPU, as they would just be re-added to the same local
>>> - * DSQ without any benefit.
>>> */
>>> - if (p->migration_pending || is_migration_disabled(p) || p->nr_cpus_allowed == 1)
>>> + if (p->migration_pending)
>>> continue;
>>
>> I think it is okay to keep "p->nr_cpus_allowed == 1" to the
>> condition. When a task is pinned to a *single* CPU, there is no
>> other place for a scheduler to put the task. Additionally, adding
>> the condition is acceptable in your example of the first task
>> running on either CPU 0 or 1.
>
> Yeah, I was also conflicted about whether to keep `nr_cpus_allowed == 1` or
> not. The main reason I lean toward re-enqueuing all tasks is that some
> schedulers may want to re-evaluate other task properties when a CPU is
> stolen, even if the target CPU doesn't change.
>
> For instance, a scheduler that adjusts time slices in function of the
> amount of waiting tasks may want to re-evaluate the assigned time slice
> after a preemption from a higher scheduling class, even if the task is
> bound to the same CPU.
>
> There's also the fact that the running task is added back to the head of
> the local DSQ on preemption from a higher scheduling class. If it's a
> per-CPU task with a long time slice assigned and we don't re-enqueue it, it
> may block more critical tasks that have arrived in the meantime.
>
> Overall, I think the re-enqueue cost is small and it keeps the door open
> for more flexible policies. What do you think?
That makes sense. A BPF scheduler might want to readjust its time slice
according to the current load.
Acked-by: Changwoo Min <changwoo@...lia.com>
Regards,
Changwoo Min
Powered by blists - more mailing lists