[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <68e34465-ecb6-409e-800c-3dd354156bb0@linux.dev>
Date: Mon, 25 Aug 2025 14:58:35 +0800
From: Chengming Zhou <chengming.zhou@...ux.dev>
To: Josh Don <joshdon@...gle.com>
Cc: Christian Loehle <christian.loehle@....com>,
"Chen, Yu C" <yu.c.chen@...el.com>, linux-kernel@...r.kernel.org,
mingo@...hat.com, bsegall@...gle.com, vschneid@...hat.com,
juri.lelli@...hat.com, rostedt@...dmis.org, mgorman@...e.de,
dietmar.eggemann@....com, vincent.guittot@...aro.org, peterz@...radead.org,
viresh.kumar@...aro.org
Subject: Re: [RFC PATCH] sched/fair: Remove sched_idle_cpu() usages in
select_task_rq_fair()
On 2025/8/22 02:13, Josh Don wrote:
> On Wed, Aug 20, 2025 at 6:53 PM Chengming Zhou <chengming.zhou@...ux.dev> wrote:
>>
>> +cc Josh and Viresh, I forgot to cc you, sorry!
>
> Thanks, missed this previously :)
>
>>
>> On 2025/8/20 21:53, Christian Loehle wrote:
>>> On 8/19/25 16:32, Chen, Yu C wrote:
>>>> On 8/18/2025 9:24 PM, Christian Loehle wrote:
>>>>> On 8/18/25 13:47, Chengming Zhou wrote:
>>>>>> These sched_idle_cpu() considerations in select_task_rq_fair() is based
>>>>>> on an assumption that the wakee task can pick a cpu running sched_idle
>>>>>> task and preempt it to run, faster than picking an idle cpu to preempt
>>>>>> the idle task.
>>>>>>
>>>>>> This assumption is correct, but it also brings some problems:
>>>>>>
>>>>>> 1. work conservation: Often sched_idle tasks are also picking the cpu
>>>>>> which is already running sched_idle task, instead of utilizing a real
>>>>>> idle cpu, so work conservation is somewhat broken.
>>>>>>
>>>>>> 2. sched_idle group: This sched_idle_cpu() is just not correct with
>>>>>> sched_idle group running. Look a simple example below.
>>>>>>
>>>>>> root
>>>>>> / \
>>>>>> kubepods system
>>>>>> / \
>>>>>> burstable besteffort
>>>>>> (cpu.idle == 1)
>
> Thanks for bringing attention to this scenario, it's been a case I've
> worried about but haven't had a good idea about fixing. Ideally we
> could find_matching_se(), but we want to do these checks locklessly
> and quickly, so that's out of the question. Agree on it being a hard
> problem.
Yeah, right, we don't want to use find_matching_se() here.
>
> One idea is that we at least handle the (what I think is fairly
> typical) scenario of a root-level sched_idle group well (a root level
You mean /kubepods and /system group in this case, right? Both of
them are not sched_idle here.
> sched_idle group is trivially idle with respect to anything else in
> the system that is not also nested under a root-level sched_idle
> group). It would be fairly easy to track a nr_idle_queued cfs_rq
> field, as well as cache on task enqueue whether it nests under a
> sched_idle group.
Ok, we can track if a task nests under a sched_idle group, like tasks
from /system and /kubepods/burstable are not under any sched_idle group,
there seems no way to distinguish them except using find_matching_se().
Thanks!
>
> Best,
> Josh
Powered by blists - more mailing lists