linux-kernel - Re: [tip:sched/core] sched/fair: Correct unit of load_above

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:	Fri, 20 May 2016 10:17:40 +0200
From:	Vincent Guittot <vincent.guittot@...aro.org>
To:	Morten Rasmussen <morten.rasmussen@....com>
Cc:	Yuyang Du <yuyang.du@...el.com>,
	Peter Zijlstra <peterz@...radead.org>,
	linux-kernel <linux-kernel@...r.kernel.org>,
	Mike Galbraith <efault@....de>, Ingo Molnar <mingo@...nel.org>,
	Dietmar Eggemann <dietmar.eggemann@....com>,
	Wanpeng Li <wanpeng.li@...mail.com>
Subject: Re: [tip:sched/core] sched/fair: Correct unit of load_above_capacity

Hi Morten,

On 19 May 2016 at 17:36, Morten Rasmussen <morten.rasmussen@....com> wrote:
> On Fri, May 13, 2016 at 10:22:19AM +0200, Vincent Guittot wrote:
>> On 12 May 2016 at 23:48, Yuyang Du <yuyang.du@...el.com> wrote:
>> > On Thu, May 12, 2016 at 03:31:51AM -0700, tip-bot for Morten Rasmussen wrote:
>> >> Commit-ID:  cfa10334318d8212d007da8c771187643c9cef35
>> >> Gitweb:     http://git.kernel.org/tip/cfa10334318d8212d007da8c771187643c9cef35
>> >> Author:     Morten Rasmussen <morten.rasmussen@....com>
>> >> AuthorDate: Fri, 29 Apr 2016 20:32:40 +0100
>> >> Committer:  Ingo Molnar <mingo@...nel.org>
>> >> CommitDate: Thu, 12 May 2016 09:55:33 +0200
>> >>
>> >> sched/fair: Correct unit of load_above_capacity
>> >>
>> >> In calculate_imbalance() load_above_capacity currently has the unit
>> >> [capacity] while it is used as being [load/capacity]. Not only is it
>> >> wrong it also makes it unlikely that load_above_capacity is ever used
>> >> as the subsequent code picks the smaller of load_above_capacity and
>> >> the avg_load
>> >>
>> >> This patch ensures that load_above_capacity has the right unit
>> >> [load/capacity].
>>
>> load_above_capacity has the unit [load] and is then compared with other load
>
> I'm not sure if talk about the same code, I'm referring to:
>
> max_pull = min(busiest->avg_load - sds->avg_load, load_above_capacity);
>
> a bit further down. Here the first option has unit [load/capacity], and
> the subsequent code multiplies the result with [capacity] to get back to
> [load]:

My understand is that it multiplies and divides by capacity
so we select the minimum between
 max_pull * busiest->group_capacity/SCHED_CAPACITY_SCALE
and
(sds->avg_load - local->avg_load) * local->group_capacity  /
SCHED_CAPACITY_SCALE

so the unit of imbalance is the same as max_pull because we multiply
and divide by [capacity]
the imbalance's unit is [load] so max_pull also has a unit [load]
and max_pull has the same unit of load_above_capacity

>
> /* How much load to actually move to equalise the imbalance */
> env->imbalance = min(
>         max_pull * busiest->group_capacity,
>         (sds->avg_load - local->avg_load) * local->group_capacity
> ) / SCHED_CAPACITY_SCALE;
>
> That lead me to the conclusion that load_above_capacity would have to be
> 'per capacity', i.e. [load/capacity], as well for the code to make
> sense.
>
>>
>> >>
>> >> Signed-off-by: Morten Rasmussen <morten.rasmussen@....com>
>> >> [ Changed changelog to note it was in capacity unit; +rebase. ]
>> >> Signed-off-by: Peter Zijlstra (Intel) <peterz@...radead.org>
>> >> Cc: Dietmar Eggemann <dietmar.eggemann@....com>
>> >> Cc: Linus Torvalds <torvalds@...ux-foundation.org>
>> >> Cc: Mike Galbraith <efault@....de>
>> >> Cc: Peter Zijlstra <peterz@...radead.org>
>> >> Cc: Thomas Gleixner <tglx@...utronix.de>
>> >> Cc: linux-kernel@...r.kernel.org
>> >> Link: http://lkml.kernel.org/r/1461958364-675-4-git-send-email-dietmar.eggemann@arm.com
>> >> Signed-off-by: Ingo Molnar <mingo@...nel.org>
>> >> ---
>> >>  kernel/sched/fair.c | 6 ++++--
>> >>  1 file changed, 4 insertions(+), 2 deletions(-)
>> >>
>> >> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> >> index 2338105..218f8e8 100644
>> >> --- a/kernel/sched/fair.c
>> >> +++ b/kernel/sched/fair.c
>> >> @@ -7067,9 +7067,11 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
>> >>       if (busiest->group_type == group_overloaded &&
>> >>           local->group_type   == group_overloaded) {
>> >>               load_above_capacity = busiest->sum_nr_running * SCHED_CAPACITY_SCALE;
>> >> -             if (load_above_capacity > busiest->group_capacity)
>> >> +             if (load_above_capacity > busiest->group_capacity) {
>> >>                       load_above_capacity -= busiest->group_capacity;
>> >> -             else
>> >> +                     load_above_capacity *= NICE_0_LOAD;
>> >> +                     load_above_capacity /= busiest->group_capacity;
>>
>> If I follow correctly the sequence,
>> load_above_capacity = (busiest->sum_nr_running * SCHED_CAPACITY_SCALE
>> - busiest->group_capacity) * NICE_0_LOAD / busiest->group_capacity
>> so
>> load_above_capacity = busiest->sum_nr_running * SCHED_CAPACITY_SCALE /
>> busiest->group_capacity * NICE_0_LOAD - NICE_0_LOAD
>>
>> so the unit is [load]
>
> I'm with you for the equation, but not for the unit and I find it hard
> convince myself what the unit really is :-(

the sole NICE_0_LOAD in the equation above tends to confirms that it's
only [load]

>
> I think it is down to the fact that we are comparing [load] directly
> with [capacity] here, basically saying [load] == [capacity]. Most other
> places we scale [load] by [capacity], we don't compare the two or
> otherwise imply that they are the same unit. Do we have any other
> examples of this?

IIUC the goal of this part of calculate_imbalance, we make the
assumption that one task with NICE_0_LOAD should fit in one core with
SCHED_CAPACITY_SCALE capacity and we want to ensure that we will not
make a CPU idle

>
> The name load_above_capacity suggests that it should have unit [load],
> but we actually compute it by subtracting to values, where the latter
> clearly has unit [capacity]. PeterZ's recent patch (1be0eb2a97
> sched/fair: Clean up scale confusion) changes the former to be
> [capacity].

I have made a comment on this.
The original equation was
load_above_capacity -= busiest->group_capacity
which come from the optimization of
load_above_capacity -= busiest->group_capacity * SCHED_LOAD_SCALE /
SCHED_CAPACITY_SCALE

The assumption of the optimization was the SCHED_LOAD_SCALE ==
SCHED_CAPACITY_SCALE and SCHED_LOAD_SCALE == NICE_0_LOAD but  it's not
always true if  we enable the extra precision for 64bits system

>
> load_above_capacity is later multiplied by [capacity] to determine the
> imbalance which must have unit [load]. So working our way backwards,

it's multiplied by xxx->group_capacity and divided by  / SCHED_CAPACITY_SCALE;

> load_above_capacity must have unit [load/capacity]. However, if [load] =
> [capacity] then what we really have is just group-normalized [load].
>
> As said in my other reply, the code only really makes sense for under
> special circumstances where [load] == [capacity]. The easy, and possibly
> best, way out of this mess would be to replace the code with something
> like PeterZ's suggestion that I tried to implement in the patch included
> in my other reply.

I'm fine with replacing this part by something more simple

>
>> Lets take the example of a group made of 2 cores with 2 threads per
>> core and the capacity of each core is 1,5* SCHED_CAPACITY_SCALE so the
>> capacity of the group is 3*SCHED_CAPACITY_SCALE.
>> If we have 6 tasks on this group :
>> load_above capacity = 1 *NICE_0_LOAD which means we want to remove no
>> more than 1 tasks from the group and let 5 tasks in the group whereas
>> we don't expect to have more than 4 tasks as we have 4 CPUs and
>> probably less because the compute capacity of each CPU is less than
>> the default one
>>
>> So i would have expected
>> load_above_capacity = busiest->sum_nr_running * NICE_0_LOAD -
>> busiest->group_capacity * NICE_0_LOAD / SCHED_CAPACITY_SCALE
>> load_above capacity = 3*NICE_0_LOAD which means we want to remove no
>> more than 3 tasks and let 3 tasks in the group
>
> And this is exactly you get with this patch :-) load_above_capacity
> (through max_pull) is multiplied by the group capacity to compute that
> actual amount of [load] to remove:

I forgot that additional weight of the load by group's
capacity/SCHED_CAPACITY_SCALE

>
> env->imbalance  = load_above_capacity * busiest->group_capacity /
>                         SCHED_CAPACITY_SCALE
>
>                 = 1*NICE_0_LOAD * 3*SCHED_CAPACITY_SCALE /
>                         SCHED_CAPACITY_SCALE
>
>                 = 3*NICE_0_LOAD
>
> I don't think we disagree on how it should work :-) Without the capacity
> scaling in this patch you get:
>
> env->imbalance = (6*SCHED_CAPACITY_SCALE - 3*SCHED_CAPACITY_SCALE) *
>                         3*SCHED_CAPACITY_SCALE / SCHED_CAPACITY_SCALE
>
>                 = 9*SCHED_CAPACITY_SCALE
>
> Coming back to Yuyang's question. I think it should be NICE_0_LOAD to
> ensure that the resulting imbalance has the proper unit [load].

yes, i agree it's should NICE_0_LOAD

>
> I'm happy to scrap this patch and replace the whole thing with something
> else that makes more sense, but IMHO it is needed if the current mess
> should make any sense at all.