linux-kernel - Re: Re: [RFC PATCH 1/2] Revert "sched/fair: Make sched-idle CPU selection consistent throughout"

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <31574ba7-a04f-49fa-aea8-a0a751915ecc@bytedance.com>
Date: Wed, 19 Mar 2025 18:36:20 +0800
From: Abel Wu <wuyun.abel@...edance.com>
To: Vincent Guittot <vincent.guittot@...aro.org>
Cc: K Prateek Nayak <kprateek.nayak@....com>, Ingo Molnar <mingo@...hat.com>,
 Peter Zijlstra <peterz@...radead.org>, Juri Lelli <juri.lelli@...hat.com>,
 Dietmar Eggemann <dietmar.eggemann@....com>,
 Steven Rostedt <rostedt@...dmis.org>, Ben Segall <bsegall@...gle.com>,
 Mel Gorman <mgorman@...e.de>, Valentin Schneider <vschneid@...hat.com>,
 Josh Don <joshdon@...gle.com>, Tianchen Ding <dtcccc@...ux.alibaba.com>,
 Viresh Kumar <viresh.kumar@...aro.org>,
 "open list:SCHEDULER" <linux-kernel@...r.kernel.org>
Subject: Re: Re: [RFC PATCH 1/2] Revert "sched/fair: Make sched-idle CPU
 selection consistent throughout"

On 3/19/25 5:17 PM, Vincent Guittot wrote:
> On Thu, 13 Mar 2025 at 08:18, Abel Wu <wuyun.abel@...edance.com> wrote:
>>
>> Hi Vincent,
>>
>> On 3/12/25 12:58 AM, Vincent Guittot wrote:
>>> On Mon, 10 Mar 2025 at 08:41, Abel Wu <wuyun.abel@...edance.com> wrote:
>>>>
>>>> This reverts commit 17346452b25b98acfb395d2a82ec2e4ad0cb7a01.
>>>>
>>>> The above commit tried to unify selection policy between idle cpus
>>>> and SCHED_IDLE ones in fast- and slow-path of select_task_rq_fair()
>>>> by treating them equally (although the SCHED_IDLE cpus are turned
>>>> to be given more preference in slowpath). The test results seemed
>>>> solid, but the setup didn't take cgroup hierarchy into account,
>>>> which actually made some of our important services get affected.
>>>>
>>>> The cgroup hierarchy in our production environment looks like below,
>>>> which might be common in modern containerized setup:
>>>>
>>>>                             root
>>>>                           /       \
>>>>                   kubepods        system.slice
>>>>                   /       \\              \
>>>>           guaranteed      besteffort      containerd
>>>>
>>>>           (where 'X=A' means A is SCHED_IDLE cgroup)
>>>>
>>>> The cpu is treated as SCHED_IDLE if only besteffort is running, which
>>>> is given at least equal preference as the idle cpus when deciding where
>>>> to run a newly woken task. But the SCHED_IDLE cpus do not necessarily
>>>> mean they can be preempted soon enough to start serving the wakee, and
>>>
>>> Could you give us more details why the SCHED_IDLE cpu which runs only
>>> besteffort can't be preempted soon enough ?
>>>
>>> because kubepods vs system.slice is not sched_idle when comparing the
>>
>> Yes, exactly. What's worse is that kubepods.weight is the sum of all its
>> children and can easily reach ~3000, while system.weight is default to
>> 100. The weight configuration can be optimized, but it's another topic.
>>
>>> entities ? some maybe the definition of sched_idle_cpu should be fixed
>>> instead
>>
>> Yes, there are at least two ways to fix it:
>>
>>    1) Let sched_idle_cpu() depends on a specific task, just like Josh
>>       mentioned in the reply of 2nd patch. So if sched_idle_cpu(cpu, p)
>>       returns true, then
>>
>>          a) this rq only contains hierarchical idle tasks, and
>>          b) p can preempt current immediately
>>
>>       Please see my reply to Josh to check the details.
> 
> yeah, that should be the solution which covers all cases but at the
> cost of walking cgroup hierarchy which is far from being ideal

Only comparing curr vs wakee doesn't solve the problem. A cpu can be
treated as SCHED_IDLE iff *all* its SCHED_IDLE entities can be preempted
by the wakee.

> 
> Could we change h_nr_idle to only track fully sched idle tasks; I mean
> tasks with a full sched_idle cgroup hierarchy ? so we would be sure to
> preempt those sched idle cpus
> 
> Then, for other case of sched idle task or sched idle group childs of
> a non sched idle group then we don't consider those cpu as sched idle
> cpu

Ahthough this is correct, I think it would be too much since this kind
of setup is rare to the best of my knowledge.

> 
> 
>>
>>    2) Or get rid of sched_idle_cpu() entirely. This needs some careful
>>       rework. Now the users of cfs_rq::h_nr_idle are two parts:
>>
>>          a) select_task_rq, including sched_balance_find_dst_group_cpu()
>>             and select_idle_*(). The former is handled by this series by
>>             simply ignoring sched_idle_cpus, which should be safe since
>>             sched_idle_cpus do not always follow the goal of the slowpath
>>             to find a least loaded cpu to help load balancing. While the
>>             latter is roughly designed as following:
>>
>>             - Must search within target LLC domain, since L3$ miss is
>>               much more costly than L1/L2$
>>             - To use cache more wisely, start searching from the L1/L2$
>>               cache hot cpus like prev/recent_used/..
>>             - Cheers if found an idle cpu that can preempt immediately.
>>               This helps maximize using cpu bandwidth, hence improving
>>               total throughput
>>             - (?)
>>             - Return target anyway, at least it might be cache hot
>>
>>             It could be less optimal if simply remove sched_idle_cpu and
>>             return @target if no idle cpu available, because @target can
>>             be heavy loaded and the cache may not hot any longer when the
>>             wakee finally hit cpu. So in order not to fight with the load
>>             balancing, shall we tiebreak on cpu_load() for the non-idle
>>             cpus?

What do you think to choose a less loaded cpu if no idle ones available?
So the wakee will probably get better served, and also helps load balance.

>>
>>          b) load balance: sched_balance_domains() and dequeue_entities().
>>             We could leave this as-is, but I would propose using h_idle
>>             instead: if the on_cpu task is hierarchically idle when
>>             triggering normal load balancing, then we guess it's a less
>>             loaded cpu and can reduce balance interval. The rationale
>>             behind is that the idle entities usually get very limited
>>             bandwidth if any hierarchically non-idle tasks available.
>>             The heuristics may have false positives which can generally
>>             be divided into 3 cases:
>>
>>                  (The numbers represents hierarchical shares%)
>>
>>                                                     C
>>                     A               B              / \
>>                   /  \\           /  \\           80  20
>>                  99    1         50   50         // \
>>                                                100  (X)
> 
> 
> How the sched_idle group in B can have the same share/weight as the
> not sched idle one in case B ?

It can't, but theoretically several SCHED_IDLE siblings can be summed up
to match a niced SCHED_NORMAL entity.

				    B
				    |
		 ---------------------------------------
		 |     ||      ||      ||      ||      ||
		15	3	3	3	3	3

> 
> 
>>
>>             - Case A) The hierarchical share of h_idle tasks is indeed
>>               small. So in this case they are just get scheduled to take
>>               some breath, and the possibility of false positive is low
>>               enough to be safely ignored.
>>
>>             - Case B) The h_idle & !h_idle tasks equally share bandwidth
>>               which usually means the !h_idle part becomes less loaded
>>               and pulling some load might be preferred.
>>
>>             - Case C) The hierarchical share of h_idle tasks dominates
>>               which usually means their !h_idle parents are allowed to
>>               use a big portion of bandwidth. In this case speedup the
>>               balance is still fine because we could pull some !h_idle
>>               tasks for the most 'important' cgroup.
>>
>>             So the heuristics of using rq->curr's h_idle to judge the need
>>             of pulling (load balancing) seems fine.
>>
>>       And as a result cfs_rq::h_nr_idle can be removed and its maintenance
>>       cost in hotpath can be saved.
>>
>> Which way do you prefer? It would be much appreciated if you can shed some
>> light on this.
>>
>>>
>>> a sched_idle_cpu should be preempted immediately otherwise it's not a
>>> sched idle cpu and the definition is meaningless
>>
>> Agree.
>>
>> Thanks!
>>          Abel
>>