linux-kernel - Re: change in sched cpu_power causing regressions with SCHED

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20100213183356.GC5882@dirshya.in.ibm.com>
Date:	Sun, 14 Feb 2010 00:03:56 +0530
From:	Vaidyanathan Srinivasan <svaidy@...ux.vnet.ibm.com>
To:	Suresh Siddha <suresh.b.siddha@...el.com>
Cc:	Peter Zijlstra <peterz@...radead.org>,
	Ingo Molnar <mingo@...hat.com>,
	LKML <linux-kernel@...r.kernel.org>,
	"Ma, Ling" <ling.ma@...el.com>,
	"Zhang, Yanmin" <yanmin_zhang@...ux.intel.com>,
	"ego@...ibm.com" <ego@...ibm.com>
Subject: Re: change in sched cpu_power causing regressions with SCHED_MC

* Suresh Siddha <suresh.b.siddha@...el.com> [2010-02-12 17:31:19]:

> Peterz,
> 
> We have one more problem that Yanmin and Ling Ma reported. On a dual
> socket quad-core platforms (for example platforms based on NHM-EP), we
> are seeing scenarios where one socket is completely busy (with all the 4
> cores running with 4 tasks) and another socket is completely idle.
> 
> This causes performance issues as those 4 tasks share the memory
> controller, last-level cache bandwidth etc. Also we won't be taking
> advantage of turbo-mode as much as we like. We will have all these
> benefits if we move two of those tasks to the other socket. Now both the
> sockets can potentially go to turbo etc and improve performance.
> 
> In short, your recent change (shown below) broke this behavior. In the
> kernel summit you mentioned you made this change with out affecting the
> behavior of SMT/MC. And my testing immediately after kernel-summit also
> didn't show the problem (perhaps my test didn't hit this specific
> change). But apparently we are having performance issues with this patch
> (Ling Ma's bisect pointed to this patch). I will look more detailed into
> this after the long weekend (to see if we can catch this scenario in
> fix_small_imbalance() etc). But wanted to give you a quick heads up.
> Thanks.
> 
> commit f93e65c186ab3c05ce2068733ca10e34fd00125e
> Author: Peter Zijlstra <a.p.zijlstra@...llo.nl>
> Date:   Tue Sep 1 10:34:32 2009 +0200
> 
>     sched: Restore __cpu_power to a straight sum of power
>     
>     cpu_power is supposed to be a representation of the process
>     capacity of the cpu, not a value to randomly tweak in order to
>     affect placement.
>     
>     Remove the placement hacks.
>     
>     Signed-off-by: Peter Zijlstra <a.p.zijlstra@...llo.nl>
>     Tested-by: Andreas Herrmann <andreas.herrmann3@....com>
>     Acked-by: Andreas Herrmann <andreas.herrmann3@....com>
>     Acked-by: Gautham R Shenoy <ego@...ibm.com>
>     Cc: Balbir Singh <balbir@...ibm.com>
>     LKML-Reference: <20090901083825.810860576@...llo.nl>
>     Signed-off-by: Ingo Molnar <mingo@...e.hu>
> 
> diff --git a/kernel/sched.c b/kernel/sched.c
> index da1edc8..584a122 100644
> --- a/kernel/sched.c
> +++ b/kernel/sched.c
> @@ -8464,15 +8464,13 @@ static void free_sched_groups(const struct cpumask *cpu_map,
>   * there are asymmetries in the topology. If there are asymmetries, group
>   * having more cpu_power will pickup more load compared to the group having
>   * less cpu_power.
> - *
> - * cpu_power will be a multiple of SCHED_LOAD_SCALE. This multiple represents
> - * the maximum number of tasks a group can handle in the presence of other idle
> - * or lightly loaded groups in the same sched domain.
>   */
>  static void init_sched_groups_power(int cpu, struct sched_domain *sd)
>  {
>  	struct sched_domain *child;
>  	struct sched_group *group;
> +	long power;
> +	int weight;
> 
>  	WARN_ON(!sd || !sd->groups);
> 
> @@ -8483,22 +8481,20 @@ static void init_sched_groups_power(int cpu, struct sched_domain *sd)
> 
>  	sd->groups->__cpu_power = 0;
> 
> -	/*
> -	 * For perf policy, if the groups in child domain share resources
> -	 * (for example cores sharing some portions of the cache hierarchy
> -	 * or SMT), then set this domain groups cpu_power such that each group
> -	 * can handle only one task, when there are other idle groups in the
> -	 * same sched domain.
> -	 */
> -	if (!child || (!(sd->flags & SD_POWERSAVINGS_BALANCE) &&
> -		       (child->flags &
> -			(SD_SHARE_CPUPOWER | SD_SHARE_PKG_RESOURCES)))) {
> -		sg_inc_cpu_power(sd->groups, SCHED_LOAD_SCALE);
> +	if (!child) {
> +		power = SCHED_LOAD_SCALE;
> +		weight = cpumask_weight(sched_domain_span(sd));
> +		/*
> +		 * SMT siblings share the power of a single core.
> +		 */
> +		if ((sd->flags & SD_SHARE_CPUPOWER) && weight > 1)
> +			power /= weight;
> +		sg_inc_cpu_power(sd->groups, power);
>  		return;
>  	}
> 
>  	/*
> -	 * add cpu_power of each child group to this groups cpu_power
> +	 * Add cpu_power of each child group to this groups cpu_power.
>  	 */
>  	group = child->groups;
>  	do {
> 

I have hit the same problem in older non-HT quad cores also.
(http://lkml.org/lkml/2010/2/8/80)

The following condition in find_busiest_group()
	sds.max_load <= sds.busiest_load_per_task

	treats unequally loaded groups as balanced as longs they are below
	capacity.
        
        We need to change the above condition before we hit the
        fix_small_imbalance() step.
        
--Vaidy
         
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/