linux-kernel - Re: [PATCH V6] sched/fair: Remove group imbalance from calculate

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <6025c22e-d6dc-b059-a105-63b82674df80@codeaurora.org>
Date:   Fri, 28 Jul 2017 12:09:42 -0600
From:   Jeffrey Hugo <jhugo@...eaurora.org>
To:     Dietmar Eggemann <dietmar.eggemann@....com>,
        Peter Zijlstra <peterz@...radead.org>
Cc:     Ingo Molnar <mingo@...hat.com>, linux-kernel@...r.kernel.org,
        Austin Christ <austinwc@...eaurora.org>,
        Tyler Baicar <tbaicar@...eaurora.org>,
        Timur Tabi <timur@...eaurora.org>
Subject: Re: [PATCH V6] sched/fair: Remove group imbalance from
 calculate_imbalance()

On 7/28/2017 7:34 AM, Dietmar Eggemann wrote:
> On 28/07/17 13:59, Peter Zijlstra wrote:
>> On Fri, Jul 28, 2017 at 01:16:24PM +0100, Dietmar Eggemann wrote:
>>>>> IIRC the topology you had in mind was MC + DIE level with n (n > 2) DIE
>>>>> level sched groups.
> 
> [...]
> 
>>>> If I then start a 3rd loop, I see 100% 50%,50%. I then kill the 100%.
>>>> Then instantly they balance and I get 2x100% back.
>>>
>>> Yeah, could reproduce on IVB-EP (2x10x2).
>>
>> OK, I have one of those. What should I do, because I didn't actually see
>> anything odd.
> 
> Me neither. Sorry, I was unclear. I meant I could reproduce your
> example, that one of the 50% task moves to the idle cpu on this machine.
> 
> [...]
> 

Dietmar's assessment is essentially correct. We are seeing a 
circumstance where fix_small_imbalance() could fix a situation with an 
idle core and an over-worked core, but it is being skipped. Here is a 
detailed step through of the specific case we're seeing.

Consider the following system.

DIE [                  (all cores)                  ]

MC  [0 1 2 3] [4 5 6 7] [8 9 10 11] ... [60 61 62 63]

Not NUMA, not SMT.

Imagine we have a workload which spawns 5 CPU intensive workers, which 
we want taskset to 5 cores such that each worker has its own core.  We 
want to see:

DIE [                  (all cores)                  ]

MC  [0 1 2 3] [4 5 6 7] [8 9 10 11] ... [60 61 62 63]
      * * * *   *
      T T T T   T

Where * indicates the taskset, and T indicates each thread/process.

In many cases, workloads are not ideally assigned at fork(), because 
load statistics haven't had a chance to be generated by the new work. 
This results in the following:

DIE [                  (all cores)                  ]

MC  [0 1 2 3] [4 5 6 7] [8 9 10 11] ... [60 61 62 63]
      * * * *   *
      T T   T   T
                T

Core 4 is over-assigned (more workers than it can efficiently handle) 
and core 2 is under-assigned (no work to do).  Not great, but we expect 
the system to load_balance() quickly, and move one T from core 4 to core 2.

We do not see this occur, which means the scheduler is not doing its job 
by leaving a core idle when there's work to do.

What happens -

Core 5 is idle, so it triggers load_balance().  At the MC level, it 
identifies core 4 as busiest, and attempts to pull load.  Core 5 fails 
because it is not part of the taskset, and load cannot be moved to 
another available core (6 or 7), so the group_imbalance flag is set.

Core 2 is idle, so it triggers load_balance().  It won't pull any load 
from its siblings in MC (0, 1, or 3) because doing so does not solve an 
imbalance, it just moves the imbalance around.  load_balance() at MC 
fails, so it moves up to DIE.  Since the group containing core 4 is 
group_imbalance, it is selected as busiest.

We go to calculate_imbalance().  Since each T is CPU intensive, their 
load is high, lets say 200, thus busiest->load_per_task is 200. 
However, we have a lot of idle or lightly loaded cores at the DIE level, 
so the sds average (sds->avg_load) gets pushed down significantly; lets 
say it is 20.  Since busiest->group_imbalance is true, we take the min 
of busiest->load_per_task and sds average (200 and 20), and make that 
the new busiest->load_per_task.

We calculate env->imbalance, which we'll say is 48.

Now we hit the end check -
	/*
	 * if *imbalance is less than the average load per runnable task
	 * there is no guarantee that any tasks will be moved so we'll have
	 * a think about bumping its value to force at least one task to be
	 * moved
	 */
	if (env->imbalance < busiest->load_per_task)
		return fix_small_imbalance(env, sds);

We try to see if the calculated imbalance can be solved by moving tasks 
around by comparing the calculated imbalance (env->imbalance) to the 
calculated average load per task in the busiest group 
(busiest->load_per_task).  Since busiest->load_per_task is an average, 
either there are tasks with load less than that, or all tasks are at 
that value, thus if env->imbalance is more than busiest->load_per_task, 
we have candidate tasks we can evaluate to address the imbalance.

env->imbalance is 48, and the modified busiest->load_per_task is 20, so 
the check fails, and we exit out of calculate_imbalance() with an 
imbalance to solve of 48.  We jump all the way up to load_balance(), and 
then into detach_tasks().  We find two tasks that are migration 
candidates because core 2 is in the same taskset as core 4, so we 
evaluate if they can help solve the imbalance, or would just move the 
imbalance around -

		if ((load / 2) > env->imbalance)
			goto next;

Load of each task is 200, so 200 / 2 = 100, 100 > 48.  We can't move 
anything as doing so would not help solve the imbalance, just move it 
around, and thus why take the hit of actually migrating the process.

We might not resolve the imbalance, but we get better CPU usage if we 
move a task from core 4 to core 2, which is what we expect. Based on a 
comment within the function fix_small_imbalance(), this appears to be 
the correct path for this circumstance.

In fix_small_imbalance() -

	/*
	 * OK, we don't have enough imbalance to justify moving tasks,
	 * however we may be able to increase total CPU capacity used by
	 * moving them.
	 */

As mentioned above, we aren't hitting this function path for the 
following reason:
	if (env->imbalance < busiest->load_per_task)
		return fix_small_imbalance(env, sds);

env->imbalance is 48, and busiest->load_per_task is 20, but 
busiest->load_per_task was 200.  48 < 200 is true, so we would hit 
fix_small_imbalance. The group_imbalance path modifies 
busiest->load_per_task, so what happens when we remove that path (i.e. 
this change)?  We see the behavior we expect.

This -

DIE [                  (all cores)                  ]

MC  [0 1 2 3] [4 5 6 7] [8 9 10 11] ... [60 61 62 63]
      * * * *   *
      T T   T   T
                T

Goes to this -

DIE [                  (all cores)                  ]

MC  [0 1 2 3] [4 5 6 7] [8 9 10 11] ... [60 61 62 63]
      * * * *   *
      T T T T   T



So, we've identified an issue, and root caused it to a small code 
snippet.  Great for us, but the scheduler is generic code, it needs to 
work for all.  Can this code really be removed?

Here is our current thinking on all uses of this code path. Let us know 
if we missed anything.

SMT avoids this issue because of SD_PREFER_SIBLING - group overloaded is 
like group imbalanced, but does not set busiest->group_imbalance so 
busiest->load_per_task never gets modified.

In cases where the path is actually used:

1- busiest->group_imbalance is not set.  The code path is equal to a 
NOP, therefore it has no positive or negative effect.

2- busiest->group_imbalance is set, but sds->avg_load is greater than 
busiest->load_per_task.  busiest->load_per_task gets assigned the value 
of busiest->load_per_task, so nothing changes.  The code path is equal 
to a NOP, therefore it has no positive or negative effect.

3- busiest->group_imbalance is set, but sds->avg_load is lesser than 
busiest->load_per_task.  busiest->load_per_task gets assigned the value 
of sds->avg_load.  If the calculated imbalance (env->imbalance) is more 
than the original busiest->load_per_task, it will be more than 
sds->avg_load, so we do not hit fix_small_imbalance() as expected.  The 
code path is equal to a NOP, therefore it has no positive or negative 
effect.

4- busiest->group_imbalance is set, but sds->avg_load is lesser than 
busiest->load_per_task.  busiest->load_per_task gets assigned the value 
of sds->avg_load.  If the calculated imbalance (env->imbalance) is less 
than the original busiest->load_per_task, but also less than 
sds->avg_load, we hit fix_small_imbalance(), but would have hit that 
anyways.  The code path is equal to a NOP, therefore it has no positive 
or negative effect.

5- busiest->group_imbalance is set, but sds->avg_load is lesser than 
busiest->load_per_task.  busiest->load_per_task gets assigned the value 
of sds->avg_load.  If the calculated imbalance (env->imbalance) is less 
than the original busiest->load_per_task, but more than sds->avg_load, 
we do not hit fix_small_imbalance().  If we happen to have tasks we can 
move around based on their load values compared to the current 
env->imbalance, we can make migrations.  The code path is equal to a 
NOP, therefore it has no positive or negative effect.

6- busiest->group_imbalance is set, but sds->avg_load is lesser than 
busiest->load_per_task.  busiest->load_per_task gets assigned the value 
of sds->avg_load.  If the calculated imbalance (env->imbalance) is less 
than the original busiest->load_per_task, but more than sds->avg_load, 
we do not hit fix_small_imbalance().  If we do not have tasks which can 
migrate based on their load value compared to the current 
env->imbalance, we hit the issue we've described above.  The code path 
has affected the behavior of the scheduler in a way which is not 
expected, therefore it has a negative effect.

So, in all of these scenarios but one, the group_imbalance code path of 
calculate_imbalance() does nothing.  In the remaining one scenario, the 
code path prevents the scheduler from doing its job (assigning work to 
idle cores), so it harms the system.

Thus we conclude that the code has no benefit, and should be removed.

Considered from a different perspective, busiest->load_per_task, in the 
current algorithm, represents actual load that can be moved around. 
Setting it to a global average breaks this invariant and causes a faulty 
decision at the fix_small_imbalance() path.


In summary, our system is harmed by the code branch, and our analysis 
indicates the branch has no benefit to the scheduler as it sits today, 
we feel justified in removing it.

-- 
Jeffrey Hugo
Qualcomm Datacenter Technologies as an affiliate of Qualcomm 
Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the
Code Aurora Forum, a Linux Foundation Collaborative Project.