[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <31574ba7-a04f-49fa-aea8-a0a751915ecc@bytedance.com>
Date: Wed, 19 Mar 2025 18:36:20 +0800
From: Abel Wu <wuyun.abel@...edance.com>
To: Vincent Guittot <vincent.guittot@...aro.org>
Cc: K Prateek Nayak <kprateek.nayak@....com>, Ingo Molnar <mingo@...hat.com>,
Peter Zijlstra <peterz@...radead.org>, Juri Lelli <juri.lelli@...hat.com>,
Dietmar Eggemann <dietmar.eggemann@....com>,
Steven Rostedt <rostedt@...dmis.org>, Ben Segall <bsegall@...gle.com>,
Mel Gorman <mgorman@...e.de>, Valentin Schneider <vschneid@...hat.com>,
Josh Don <joshdon@...gle.com>, Tianchen Ding <dtcccc@...ux.alibaba.com>,
Viresh Kumar <viresh.kumar@...aro.org>,
"open list:SCHEDULER" <linux-kernel@...r.kernel.org>
Subject: Re: Re: [RFC PATCH 1/2] Revert "sched/fair: Make sched-idle CPU
selection consistent throughout"
On 3/19/25 5:17 PM, Vincent Guittot wrote:
> On Thu, 13 Mar 2025 at 08:18, Abel Wu <wuyun.abel@...edance.com> wrote:
>>
>> Hi Vincent,
>>
>> On 3/12/25 12:58 AM, Vincent Guittot wrote:
>>> On Mon, 10 Mar 2025 at 08:41, Abel Wu <wuyun.abel@...edance.com> wrote:
>>>>
>>>> This reverts commit 17346452b25b98acfb395d2a82ec2e4ad0cb7a01.
>>>>
>>>> The above commit tried to unify selection policy between idle cpus
>>>> and SCHED_IDLE ones in fast- and slow-path of select_task_rq_fair()
>>>> by treating them equally (although the SCHED_IDLE cpus are turned
>>>> to be given more preference in slowpath). The test results seemed
>>>> solid, but the setup didn't take cgroup hierarchy into account,
>>>> which actually made some of our important services get affected.
>>>>
>>>> The cgroup hierarchy in our production environment looks like below,
>>>> which might be common in modern containerized setup:
>>>>
>>>> root
>>>> / \
>>>> kubepods system.slice
>>>> / \\ \
>>>> guaranteed besteffort containerd
>>>>
>>>> (where 'X=A' means A is SCHED_IDLE cgroup)
>>>>
>>>> The cpu is treated as SCHED_IDLE if only besteffort is running, which
>>>> is given at least equal preference as the idle cpus when deciding where
>>>> to run a newly woken task. But the SCHED_IDLE cpus do not necessarily
>>>> mean they can be preempted soon enough to start serving the wakee, and
>>>
>>> Could you give us more details why the SCHED_IDLE cpu which runs only
>>> besteffort can't be preempted soon enough ?
>>>
>>> because kubepods vs system.slice is not sched_idle when comparing the
>>
>> Yes, exactly. What's worse is that kubepods.weight is the sum of all its
>> children and can easily reach ~3000, while system.weight is default to
>> 100. The weight configuration can be optimized, but it's another topic.
>>
>>> entities ? some maybe the definition of sched_idle_cpu should be fixed
>>> instead
>>
>> Yes, there are at least two ways to fix it:
>>
>> 1) Let sched_idle_cpu() depends on a specific task, just like Josh
>> mentioned in the reply of 2nd patch. So if sched_idle_cpu(cpu, p)
>> returns true, then
>>
>> a) this rq only contains hierarchical idle tasks, and
>> b) p can preempt current immediately
>>
>> Please see my reply to Josh to check the details.
>
> yeah, that should be the solution which covers all cases but at the
> cost of walking cgroup hierarchy which is far from being ideal
Only comparing curr vs wakee doesn't solve the problem. A cpu can be
treated as SCHED_IDLE iff *all* its SCHED_IDLE entities can be preempted
by the wakee.
>
> Could we change h_nr_idle to only track fully sched idle tasks; I mean
> tasks with a full sched_idle cgroup hierarchy ? so we would be sure to
> preempt those sched idle cpus
>
> Then, for other case of sched idle task or sched idle group childs of
> a non sched idle group then we don't consider those cpu as sched idle
> cpu
Ahthough this is correct, I think it would be too much since this kind
of setup is rare to the best of my knowledge.
>
>
>>
>> 2) Or get rid of sched_idle_cpu() entirely. This needs some careful
>> rework. Now the users of cfs_rq::h_nr_idle are two parts:
>>
>> a) select_task_rq, including sched_balance_find_dst_group_cpu()
>> and select_idle_*(). The former is handled by this series by
>> simply ignoring sched_idle_cpus, which should be safe since
>> sched_idle_cpus do not always follow the goal of the slowpath
>> to find a least loaded cpu to help load balancing. While the
>> latter is roughly designed as following:
>>
>> - Must search within target LLC domain, since L3$ miss is
>> much more costly than L1/L2$
>> - To use cache more wisely, start searching from the L1/L2$
>> cache hot cpus like prev/recent_used/..
>> - Cheers if found an idle cpu that can preempt immediately.
>> This helps maximize using cpu bandwidth, hence improving
>> total throughput
>> - (?)
>> - Return target anyway, at least it might be cache hot
>>
>> It could be less optimal if simply remove sched_idle_cpu and
>> return @target if no idle cpu available, because @target can
>> be heavy loaded and the cache may not hot any longer when the
>> wakee finally hit cpu. So in order not to fight with the load
>> balancing, shall we tiebreak on cpu_load() for the non-idle
>> cpus?
What do you think to choose a less loaded cpu if no idle ones available?
So the wakee will probably get better served, and also helps load balance.
>>
>> b) load balance: sched_balance_domains() and dequeue_entities().
>> We could leave this as-is, but I would propose using h_idle
>> instead: if the on_cpu task is hierarchically idle when
>> triggering normal load balancing, then we guess it's a less
>> loaded cpu and can reduce balance interval. The rationale
>> behind is that the idle entities usually get very limited
>> bandwidth if any hierarchically non-idle tasks available.
>> The heuristics may have false positives which can generally
>> be divided into 3 cases:
>>
>> (The numbers represents hierarchical shares%)
>>
>> C
>> A B / \
>> / \\ / \\ 80 20
>> 99 1 50 50 // \
>> 100 (X)
>
>
> How the sched_idle group in B can have the same share/weight as the
> not sched idle one in case B ?
It can't, but theoretically several SCHED_IDLE siblings can be summed up
to match a niced SCHED_NORMAL entity.
B
|
---------------------------------------
| || || || || ||
15 3 3 3 3 3
>
>
>>
>> - Case A) The hierarchical share of h_idle tasks is indeed
>> small. So in this case they are just get scheduled to take
>> some breath, and the possibility of false positive is low
>> enough to be safely ignored.
>>
>> - Case B) The h_idle & !h_idle tasks equally share bandwidth
>> which usually means the !h_idle part becomes less loaded
>> and pulling some load might be preferred.
>>
>> - Case C) The hierarchical share of h_idle tasks dominates
>> which usually means their !h_idle parents are allowed to
>> use a big portion of bandwidth. In this case speedup the
>> balance is still fine because we could pull some !h_idle
>> tasks for the most 'important' cgroup.
>>
>> So the heuristics of using rq->curr's h_idle to judge the need
>> of pulling (load balancing) seems fine.
>>
>> And as a result cfs_rq::h_nr_idle can be removed and its maintenance
>> cost in hotpath can be saved.
>>
>> Which way do you prefer? It would be much appreciated if you can shed some
>> light on this.
>>
>>>
>>> a sched_idle_cpu should be preempted immediately otherwise it's not a
>>> sched idle cpu and the definition is meaningless
>>
>> Agree.
>>
>> Thanks!
>> Abel
>>
Powered by blists - more mailing lists