linux-kernel - Re: [PATCH] sched/fair: Skip wake

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <56095E7C.7080300@odin.com>
Date:	Mon, 28 Sep 2015 18:36:28 +0300
From:	Kirill Tkhai <ktkhai@...n.com>
To:	Mike Galbraith <umgwanakikbuti@...il.com>
CC:	<linux-kernel@...r.kernel.org>,
	Peter Zijlstra <peterz@...radead.org>,
	Ingo Molnar <mingo@...hat.com>
Subject: Re: [PATCH] sched/fair: Skip wake_affine() for core siblings

On 28.09.2015 16:12, Mike Galbraith wrote:
> On Mon, 2015-09-28 at 13:28 +0300, Kirill Tkhai wrote:
> 
>> Looks like, NAK may be better, because it saves L1 cache, while the patch always invalidates it.
> 
> Yeah, bounce hurts more when there's no concurrency win waiting to be
> collected.  This mixed load wasn't a great choice, but it turned out to
> be pretty interesting.  Something waking a gaggle of waiters on a busy
> big socket should do very bad things.
> 
>> Could you say, do you execute pgbench using just -cX -jY -T30 or something special? I've tried it,
>> but the dispersion of the results much differs from time to time.
> 
> pgbench -T $testtime -j 1 -S -c $clients

Using -S the results stabilized. It looks like my db is enormous, and some problem with that. I will
investigate.

Thanks!
 
>>> Ok, that's what I want to see, full repeat.
>>> master = twiddle
>>> master+ = twiddle+patch
>>>
>>> concurrent tbench 4 + pgbench, 2 minutes per client count (i4790+smt)
>>>                                              master                           master+
>>> pgbench                   1       2       3     avg         1       2       3     avg   comp
>>> clients 1       tps = 18599   18627   18532   18586     17480   17682   17606   17589   .946
>>> clients 2       tps = 32344   32313   32408   32355     25167   26140   23730   25012   .773
>>> clients 4       tps = 52593   51390   51095   51692     22983   23046   22427   22818   .441
>>> clients 8       tps = 70354   69583   70107   70014     66924   66672   69310   67635   .966
>>>
>>> Hrm... turn the tables, measure tbench while pgbench 4 client load runs endlessly.
>>>
>>>                                              master                           master+
>>> tbench                    1       2       3     avg         1       2       3     avg   comp
>>> pairs 1        MB/s =   430     426     436     430       481     481     494     485  1.127
>>> pairs 2        MB/s =  1083    1085    1072    1080      1086    1090    1083    1086  1.005
>>> pairs 4        MB/s =  1725    1697    1729    1717      2023    2002    2006    2010  1.170
>>> pairs 8        MB/s =  2740    2631    2700    2690      3016    2977    3071    3021  1.123
>>>
>>> tbench without competition
>>>                master        master+   comp
>>> pairs 1        MB/s =   694     692    .997 
>>> pairs 2        MB/s =  1268    1259    .992
>>> pairs 4        MB/s =  2210    2165    .979
>>> pairs 8        MB/s =  3586    3526    .983  (yawn, all within routine variance)
>>
>> Hm, it seems tbench with competition is better only because of a busy system makes tbench
>> processes be woken on the same cpu.
> 
> Yeah.  When box is really full, select_idle_sibling() (obviously) turns
> into a waste of cycles, but even as you approach that, especially when
> filling the box with identical copies of nearly fully synchronous high
> frequency localhost packet blasters, stacking is a win.
> 
> What bent my head up a bit was the combined effect of making wake_wide()
> really keep pgbench from collapsing then adding the affine wakeup grant
> for tbench.  It's not at all clear to me why 2,4 would be so demolished.

Mike, one more moment. wake_wide() and current logic confuses me a bit.
It makes us to decide if we want affine wakeup or not, but select_idle_sibling()
if a function is not for choosing this_cpu's llc domain only. We use it
for searching in prev_cpu llc domain too, and it seems we are not interested
in current flips in this case. Imagine a situation, when we share a mutex
with a task on another NUMA node. When the task is realising the mutex
it is waking us, but we definitelly won't use affine logic in this case.
We wake the wakee anywhere and loose hot cache. I changed the logic, and
tried pgbench 1:8. The results (I threw away 3 first iterations, because
they much differ with iter >= 4. Looks like, the reason is in uncached disk IO).


Before:

  trans. |  tps (i)     | tps (e)
--------------------------------------
12098226 | 60491.067392 | 60500.886373
12030184 | 60150.874285 | 60160.654295
11882977 | 59414.829150 | 59424.830637
12020125 | 60100.579023 | 60111.600176
12161917 | 60809.547906 | 60827.321639
12154660 | 60773.249254 | 60783.085165

After:

 trans.  | tps (i)      | tps (e)
--------------------------------------
12770407 | 63849.883578 | 63860.310019      
12635366 | 63176.399769 | 63187.152569      
12676890 | 63384.396440 | 63400.930755      
12639949 | 63199.526330 | 63210.460753      
12670626 | 63353.079951 | 63363.274143      
12647001 | 63209.613698 | 63219.812331      

I'm going to test other cases, but could you tell me (if you remember) are there reasons
we skip prev_cpu, like I described above? Some types of workloads etc.

---
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 4df37a4..dfbe06b 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4930,8 +4930,13 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f
 	int want_affine = 0;
 	int sync = wake_flags & WF_SYNC;
 
-	if (sd_flag & SD_BALANCE_WAKE)
-		want_affine = !wake_wide(p) && cpumask_test_cpu(cpu, tsk_cpus_allowed(p));
+	if (sd_flag & SD_BALANCE_WAKE) {
+		want_affine = 1;
+		if (cpu == prev_cpu || !cpumask_test_cpu(cpu, tsk_cpus_allowed(p)))
+			goto want_affine;
+		if (wake_wide(p))
+			goto want_affine;
+	}
 
 	rcu_read_lock();
 	for_each_domain(cpu, tmp) {
@@ -4954,16 +4959,12 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f
 			break;
 	}
 
-	if (affine_sd) {
+want_affine:
+	if (want_affine) {
 		sd = NULL; /* Prefer wake_affine over balance flags */
-		if (cpu != prev_cpu && wake_affine(affine_sd, p, sync))
+		if (affine_sd && wake_affine(affine_sd, p, sync))
 			new_cpu = cpu;
-	}
-
-	if (!sd) {
-		if (sd_flag & SD_BALANCE_WAKE) /* XXX always ? */
-			new_cpu = select_idle_sibling(p, new_cpu);
-
+		new_cpu = select_idle_sibling(p, new_cpu);
 	} else while (sd) {
 		struct sched_group *group;
 		int weight;
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/