linux-kernel - Re: [sched] 143e1e28cb4: +17.9% aim7.jobs-per-min, -9.7% hackbench.throughput

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20140811133352.GC9918@twins.programming.kicks-ass.net>
Date:	Mon, 11 Aug 2014 15:33:52 +0200
From:	Peter Zijlstra <peterz@...radead.org>
To:	Fengguang Wu <fengguang.wu@...el.com>
Cc:	Vincent Guittot <vincent.guittot@...aro.org>,
	Dave Hansen <dave.hansen@...el.com>,
	LKML <linux-kernel@...r.kernel.org>, lkp@...org,
	Ingo Molnar <mingo@...nel.org>,
	Dietmar Eggemann <dietmar.eggemann@....com>,
	Preeti U Murthy <preeti@...ux.vnet.ibm.com>
Subject: Re: [sched] 143e1e28cb4: +17.9% aim7.jobs-per-min, -9.7%
 hackbench.throughput

On Sun, Aug 10, 2014 at 06:54:13PM +0800, Fengguang Wu wrote:
> This view may be easier to read, by grouping the metrics by test case.
> 
> test case: brickland1/aim7/6000-page_test

OK, I have a similar system to the brickland thing (slightly different
configuration, but should be close enough).

Now; do you have a description of each test-case someplace? In
particular, it might be good to have a small annotation to show which
direction is better.

> 
>     128529 ± 1%     +17.9%     151594 ± 0%  TOTAL aim7.jobs-per-min

jobs per minute, + is better, so no worries there.

>     582269 ±14%     -55.6%     258617 ±16%  TOTAL softirqs.SCHED
>     993654 ± 2%     -19.9%     795962 ± 3%  TOTAL softirqs.RCU
>   15865125 ± 1%     -15.0%   13485882 ± 1%  TOTAL softirqs.TIMER

>   59366697 ± 3%     -46.1%   32017187 ± 7%  TOTAL cpuidle.C1-IVT.time
>      54543 ±11%     -37.2%      34252 ±16%  TOTAL cpuidle.C1-IVT.usage
>      19542 ± 9%     -38.3%      12057 ± 4%  TOTAL cpuidle.C1E-IVT.usage
>   49527464 ± 6%     -32.4%   33488833 ± 4%  TOTAL cpuidle.C1E-IVT.time
>      76064 ± 3%     -32.2%      51572 ± 6%  TOTAL cpuidle.C6-IVT.usage

Less idle time; might be good, if the work is cpubound, might be bad if
not; hard to say.

>       2.82 ± 3%     +21.9%       3.43 ± 4%  TOTAL turbostat.%pc2
>       4.40 ± 2%     +22.0%       5.37 ± 4%  TOTAL turbostat.%c6
>      15.75 ± 1%      -3.4%      15.21 ± 0%  TOTAL turbostat.RAM_W

>    3150464 ± 2%     -24.2%    2387551 ± 3%  TOTAL time.voluntary_context_switches

Typically less ctxsw is better..

>        281 ± 1%     -15.1%        238 ± 0%  TOTAL time.elapsed_time
>      29294 ± 1%     -14.3%      25093 ± 0%  TOTAL time.system_time

Less time spend (on presumably the same work) is better

>    4529818 ± 1%      -8.8%    4129398 ± 1%  TOTAL time.involuntary_context_switches

Less preemptions, also generally better

>      10655 ± 0%      +1.4%      10802 ± 0%  TOTAL time.percent_of_cpu_this_job_got

Seem an improvement; not sure.

Many more stats.. but from the above it looks like its an overall 'win';
or am I reading the thing wrong?


Now I think I see why this is; we've reduced load balancing frequency
significantly on this machine due to:


-#define SD_SIBLING_INIT (struct sched_domain) {                                \
-       .min_interval           = 1,                                    \
-       .max_interval           = 2,                                    \


-#define SD_MC_INIT (struct sched_domain) {                             \
-       .min_interval           = 1,                                    \
-       .max_interval           = 4,                                    \


-#define SD_CPU_INIT (struct sched_domain) {                            \
-       .min_interval           = 1,                                    \
-       .max_interval           = 4,                                    \


        *sd = (struct sched_domain){
                .min_interval           = sd_weight,
                .max_interval           = 2*sd_weight,

Which both increased the min and max value significantly for all domains
involved.

That said; I think we might want to do something like the below; I can
imagine decreasing load balancing too much will negatively impact other
workloads.

Maybe slightly modified to make sure the first domain has a min_interval
of 1.

---
 kernel/sched/core.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 1211575a2208..67ed5d854da1 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6049,8 +6049,8 @@ sd_init(struct sched_domain_topology_level *tl, int cpu)
 		sd_flags &= ~TOPOLOGY_SD_FLAGS;
 
 	*sd = (struct sched_domain){
-		.min_interval		= sd_weight,
-		.max_interval		= 2*sd_weight,
+		.min_interval		= max(1, sd_weight/2),
+		.max_interval		= sd_weight,
 		.busy_factor		= 32,
 		.imbalance_pct		= 125,
 
@@ -6076,7 +6076,7 @@ sd_init(struct sched_domain_topology_level *tl, int cpu)
 					,
 
 		.last_balance		= jiffies,
-		.balance_interval	= sd_weight,
+		.balance_interval	= max(1, sd_weight/2),
 		.smt_gain		= 0,
 		.max_newidle_lb_cost	= 0,
 		.next_decay_max_lb_cost	= jiffies,

Content of type "application/pgp-signature" skipped