linux-kernel - Re: [PATCH] sched/fair: Do not decay new task load on first enqueue

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <82167f6a-2d97-d3ab-35eb-4d4fa0c62bd9@arm.com>
Date:   Thu, 29 Sep 2016 17:15:17 +0100
From:   Dietmar Eggemann <dietmar.eggemann@....com>
To:     Vincent Guittot <vincent.guittot@...aro.org>
Cc:     Peter Zijlstra <peterz@...radead.org>,
        Matt Fleming <matt@...eblueprint.co.uk>,
        Ingo Molnar <mingo@...nel.org>,
        linux-kernel <linux-kernel@...r.kernel.org>,
        Mike Galbraith <umgwanakikbuti@...il.com>,
        Yuyang Du <yuyang.du@...el.com>
Subject: Re: [PATCH] sched/fair: Do not decay new task load on first enqueue

On 28/09/16 14:13, Vincent Guittot wrote:
> Le Wednesday 28 Sep 2016 à 05:27:54 (-0700), Vincent Guittot a écrit :
>> On 28 September 2016 at 04:31, Dietmar Eggemann
>> <dietmar.eggemann@....com> wrote:
>>> On 28/09/16 12:19, Peter Zijlstra wrote:
>>>> On Wed, Sep 28, 2016 at 12:06:43PM +0100, Dietmar Eggemann wrote:
>>>>> On 28/09/16 11:14, Peter Zijlstra wrote:
>>>>>> On Fri, Sep 23, 2016 at 12:58:08PM +0100, Matt Fleming wrote:

[...]

> IIUC the problem raised by Matt, he see a regression because we now remove
> during the dequeue the exact same load as during the enqueue so
> cfs_rq->runnable_load_avg is null so we select a cfs_rq that might already have
> a lot of hackbench blocked thread.

This is my understanding as well.

> The fact that runnable_load_avg is null, when the cfs_rq doesn't have runnable
> task, is quite correct and we should keep it. But when we look for the idlest
> group, we have to take into account the blocked thread.
> 
> That's what i have tried to do below

[...]

> +		/*
> +		 * In case that we have same runnable load (especially null
> +		 *  runnable load), we select the group with smallest blocked
> +		 *  load
> +		 */
> +			min_avg_load = avg_load;
> +			min_runnable_load = runnable_load;

Setting 'min_runnable_load' wouldn't be necessary here.

>  			idlest = group;
>  		}
> +
>  	} while (group = group->next, group != sd->groups);
>  
> -	if (!idlest || 100*this_load < imbalance*min_load)
> +	if (!idlest || 100*this_load < imbalance*min_runnable_load)
>  		return NULL;
>  	return idlest;

On the Hikey board (ARM64) (2 cluster, each 4 cpu's, so MC and DIE), the
first f_i_g (on DIE) is still based on rbl_load. So if the first
hackbench task (spawning all the worker task) runs on cluster1, and the
former worker p_X already blocks f_i_g returns cluster2, if p_X still
runs, it returns idlest=NULL and we continue with cluster1 for second
f_i_g on MC.

The additional 'else if' condition doesn't seem to help much because of
occurrences where an idle cpu (which never took a worker) still has a
small value of rbl_load (shouldn't actually happen, weighted_cpuload()
should be 0) so it is never chosen or it has even a negative impact in
the case where an idle cpu (which never took a worker) is not chosen
because its load (cfs->avg.load_avg) hasn't been updated for a long time
so another cpu with rbl_load = 0 and a smaller load is used (even though
a lot of worker where already placed on it).

There are also episodes where we 'pack' workers onto the cpu which is
initially picked in f_i_c (on DIE) because (100*this_load <
imbalance*min_load) is true in f_i_g on MC. Maybe we can get rid of this
for !sd->child ?

[...]