linux-kernel - Re: [PATCH] sched, fair: Allow a small degree of load imbalance between SD

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <29bf4c69-b961-0482-7582-d6f1e09e997a@arm.com>
Date:   Thu, 19 Dec 2019 15:16:27 +0000
From:   Valentin Schneider <valentin.schneider@....com>
To:     Vincent Guittot <vincent.guittot@...aro.org>,
        Mel Gorman <mgorman@...hsingularity.net>
Cc:     Ingo Molnar <mingo@...nel.org>,
        Peter Zijlstra <peterz@...radead.org>, pauld@...hat.com,
        srikar@...ux.vnet.ibm.com, quentin.perret@....com,
        dietmar.eggemann@....com, Morten.Rasmussen@....com,
        hdanton@...a.com, parth@...ux.ibm.com, riel@...riel.com,
        LKML <linux-kernel@...r.kernel.org>
Subject: Re: [PATCH] sched, fair: Allow a small degree of load imbalance
 between SD_NUMA domains

On 19/12/2019 14:45, Vincent Guittot wrote:
>> @@ -8680,7 +8676,7 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
>>  			env->migration_type = migrate_task;
>>  			lsub_positive(&nr_diff, local->sum_nr_running);
>>  			env->imbalance = nr_diff >> 1;
>> -			return;
>> +			goto out_spare;
> 
> Why are you doing this only for prefer_sibling case ? That's probably the default case of most of numa system but you should also consider others case too.
> 

I got confused by that as well but it's not just prefer_sibling actually;
there are cases where we enter the group_has_spare but none of its
nested if blocks, so we fall through to out_spare.

>> @@ -8690,6 +8686,38 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
>>  		env->migration_type = migrate_task;
>>  		env->imbalance = max_t(long, 0, (local->idle_cpus -
>>  						 busiest->idle_cpus) >> 1);
>> +
>> +out_spare:
>> +		/*
>> +		 * Whether balancing the number of running tasks or the number
>> +		 * of idle CPUs, consider allowing some degree of imbalance if
>> +		 * migrating between NUMA domains.
>> +		 */
>> +		if (env->sd->flags & SD_NUMA) {
>> +			unsigned int imbalance_adj, imbalance_max;
>> +
>> +			/*
>> +			 * imbalance_adj is the allowable degree of imbalance
>> +			 * to exist between two NUMA domains. It's calculated
>> +			 * relative to imbalance_pct with a minimum of two
>> +			 * tasks or idle CPUs.
>> +			 */
>> +			imbalance_adj = (busiest->group_weight *
>> +				(env->sd->imbalance_pct - 100) / 100) >> 1;
>> +			imbalance_adj = max(imbalance_adj, 2U);
>> +
>> +			/*
>> +			 * Ignore imbalance unless busiest sd is close to 50%
>> +			 * utilisation. At that point balancing for memory
>> +			 * bandwidth and potentially avoiding unnecessary use
>> +			 * of HT siblings is as relevant as memory locality.
>> +			 */
>> +			imbalance_max = (busiest->group_weight >> 1) - imbalance_adj;
>> +			if (env->imbalance <= imbalance_adj &&
>> +			    busiest->sum_nr_running < imbalance_max) {i
> 
> Shouldn't you consider the number of busiest->idle_cpus instead of the busiest->sum_nr_running ?
> 

I think it's better to hinge the cutoff on the busiest->sum_nr_running than
on busiest->idle_cpus. If you're balancing between big NUMA groups, you
could end up with a busiest->group_type == group_has_spare despite having
*some* of its CPUs overloaded (but still with
sg->sum_nr_running > sg->group_weight; simply because there's tons of CPUs).

> and you could simplify by 
> 
> 
> 	if ((env->sd->flags & SD_NUMA) &&
> 		((100 * busiest->group_weight) <= (env->sd->imbalance_pct * (busiest->idle_cpus << 1)))) {
> 			env->imbalance = 0;
> 			return;
> 	}
> 
> And otherwise it will continue with the current path
> 
> Also I'm a bit worry about using a 50% threshold that look a bit like a
> heuristic which can change depending of platform and the UCs that run of the
> system.
> 
> In fact i was hoping that we could use the numa_preferred_nid ? During the
> detach of tasks, we don't detach the task if busiest has spare capacity and
> preferred_nid of the task is busiest.
> 
> I'm going to run some tests to see the impact on my platform 
> 
> Regards,
> Vincent
> }
> 
> 
>> +				env->imbalance = 0;
>> +			}
>> +		}
>>  		return;
>>  	}
>>  
>>
>> -- 
>> Mel Gorman
>> SUSE Labs