[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <c0174dd7-86f5-4f4d-b0eb-dd60515e21c5@arm.com>
Date: Wed, 20 Aug 2025 14:53:36 +0100
From: Christian Loehle <christian.loehle@....com>
To: "Chen, Yu C" <yu.c.chen@...el.com>,
Chengming Zhou <chengming.zhou@...ux.dev>
Cc: linux-kernel@...r.kernel.org, mingo@...hat.com, bsegall@...gle.com,
vschneid@...hat.com, juri.lelli@...hat.com, rostedt@...dmis.org,
mgorman@...e.de, dietmar.eggemann@....com, vincent.guittot@...aro.org,
peterz@...radead.org
Subject: Re: [RFC PATCH] sched/fair: Remove sched_idle_cpu() usages in
select_task_rq_fair()
On 8/19/25 16:32, Chen, Yu C wrote:
> On 8/18/2025 9:24 PM, Christian Loehle wrote:
>> On 8/18/25 13:47, Chengming Zhou wrote:
>>> These sched_idle_cpu() considerations in select_task_rq_fair() is based
>>> on an assumption that the wakee task can pick a cpu running sched_idle
>>> task and preempt it to run, faster than picking an idle cpu to preempt
>>> the idle task.
>>>
>>> This assumption is correct, but it also brings some problems:
>>>
>>> 1. work conservation: Often sched_idle tasks are also picking the cpu
>>> which is already running sched_idle task, instead of utilizing a real
>>> idle cpu, so work conservation is somewhat broken.
>>>
>>> 2. sched_idle group: This sched_idle_cpu() is just not correct with
>>> sched_idle group running. Look a simple example below.
>>>
>>> root
>>> / \
>>> kubepods system
>>> / \
>>> burstable besteffort
>>> (cpu.idle == 1)
>>>
>>> When a sched_idle cpu is just running tasks from besteffort group,
>>> sched_idle_cpu() will return true in this case, but this cpu pick
>>> is bad for wakee task from system group. Because the system group
>>> has lower weight than kubepods, work conservation is somewhat
>>> broken too.
>>>
>>> In a nutshell, sched_idle_cpu() should consider the wakee task group's
>>> relationship with sched_idle tasks running on the cpu.
>>>
>>> Obviously, it's hard to do so. This patch chooses the simple approach
>>> to remove all sched_idle_cpu() considerations in select_task_rq_fair()
>>> to bring back work conservation in these cases.
>>
>> OTOH sched_idle_cpu() CPUs are guaranteed to not be in an idle state and
>> potentially already have DVFS on some higher level...
>>
> Is it because the schedutil governor considers the utilization
> of SCHED_IDLE, thus causing schedutil to request a higher
> frequency?
For intel_pstate active (HWP and !HWP) the same issue should persist, no?
>
> The commit 3c29e651e16d ("sched/fair: Fall back to sched-idle
> CPU if an idle CPU isn't found") mentions that choosing a CPU
> running a SCHED_IDLE task can avoid waking a CPU from a deep
> sleep state.
>
> If this is the case, can we say that if an administrator sets
> the cpufreq governor to "performance" and disables deep idle
> states, an idle CPU would be more preferable than a CPU running
> a SCHED_IDLE task? On the other hand, if
> per_cpu(cpufreq_update_util_data, cpu) is NULL and only shallow
> idle states are enabled in idle_get_state(), should we skip
> SCHED_IDLE to achieve work conservation?
That's probably getting the most out of it.
That being said, strictly speaking the SCHED_IDLE CPU and the
SHALLOW_IDLE CPU may still share a power and thermal budget, which
may make preempting the sched-idle task on SCHED_IDLE CPU the
better choice.
Powered by blists - more mailing lists