linux-kernel - Re: [PATCH] sched/fair: Skip wake

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <56096191.7060401@odin.com>
Date:	Mon, 28 Sep 2015 18:49:37 +0300
From:	Kirill Tkhai <ktkhai@...n.com>
To:	Mike Galbraith <umgwanakikbuti@...il.com>
CC:	<linux-kernel@...r.kernel.org>,
	Peter Zijlstra <peterz@...radead.org>,
	Ingo Molnar <mingo@...hat.com>
Subject: Re: [PATCH] sched/fair: Skip wake_affine() for core siblings



On 28.09.2015 18:36, Kirill Tkhai wrote:
> On 28.09.2015 16:12, Mike Galbraith wrote:
>> On Mon, 2015-09-28 at 13:28 +0300, Kirill Tkhai wrote:
>>
>>> Looks like, NAK may be better, because it saves L1 cache, while the patch always invalidates it.
>>
>> Yeah, bounce hurts more when there's no concurrency win waiting to be
>> collected.  This mixed load wasn't a great choice, but it turned out to
>> be pretty interesting.  Something waking a gaggle of waiters on a busy
>> big socket should do very bad things.
>>
>>> Could you say, do you execute pgbench using just -cX -jY -T30 or something special? I've tried it,
>>> but the dispersion of the results much differs from time to time.
>>
>> pgbench -T $testtime -j 1 -S -c $clients
> 
> Using -S the results stabilized. It looks like my db is enormous, and some problem with that. I will
> investigate.
> 
> Thanks!
>  
>>>> Ok, that's what I want to see, full repeat.
>>>> master = twiddle
>>>> master+ = twiddle+patch
>>>>
>>>> concurrent tbench 4 + pgbench, 2 minutes per client count (i4790+smt)
>>>>                                              master                           master+
>>>> pgbench                   1       2       3     avg         1       2       3     avg   comp
>>>> clients 1       tps = 18599   18627   18532   18586     17480   17682   17606   17589   .946
>>>> clients 2       tps = 32344   32313   32408   32355     25167   26140   23730   25012   .773
>>>> clients 4       tps = 52593   51390   51095   51692     22983   23046   22427   22818   .441
>>>> clients 8       tps = 70354   69583   70107   70014     66924   66672   69310   67635   .966
>>>>
>>>> Hrm... turn the tables, measure tbench while pgbench 4 client load runs endlessly.
>>>>
>>>>                                              master                           master+
>>>> tbench                    1       2       3     avg         1       2       3     avg   comp
>>>> pairs 1        MB/s =   430     426     436     430       481     481     494     485  1.127
>>>> pairs 2        MB/s =  1083    1085    1072    1080      1086    1090    1083    1086  1.005
>>>> pairs 4        MB/s =  1725    1697    1729    1717      2023    2002    2006    2010  1.170
>>>> pairs 8        MB/s =  2740    2631    2700    2690      3016    2977    3071    3021  1.123
>>>>
>>>> tbench without competition
>>>>                master        master+   comp
>>>> pairs 1        MB/s =   694     692    .997 
>>>> pairs 2        MB/s =  1268    1259    .992
>>>> pairs 4        MB/s =  2210    2165    .979
>>>> pairs 8        MB/s =  3586    3526    .983  (yawn, all within routine variance)
>>>
>>> Hm, it seems tbench with competition is better only because of a busy system makes tbench
>>> processes be woken on the same cpu.
>>
>> Yeah.  When box is really full, select_idle_sibling() (obviously) turns
>> into a waste of cycles, but even as you approach that, especially when
>> filling the box with identical copies of nearly fully synchronous high
>> frequency localhost packet blasters, stacking is a win.
>>
>> What bent my head up a bit was the combined effect of making wake_wide()
>> really keep pgbench from collapsing then adding the affine wakeup grant
>> for tbench.  It's not at all clear to me why 2,4 would be so demolished.
> 
> Mike, one more moment. wake_wide() and current logic confuses me a bit.
> It makes us to decide if we want affine wakeup or not, but select_idle_sibling()
> if a function is not for choosing this_cpu's llc domain only. We use it
> for searching in prev_cpu llc domain too, and it seems we are not interested
> in current flips in this case. Imagine a situation, when we share a mutex
> with a task on another NUMA node. When the task is realising the mutex
> it is waking us, but we definitelly won't use affine logic in this case.
> We wake the wakee anywhere and loose hot cache. I changed the logic, and
> tried pgbench 1:8. The results (I threw away 3 first iterations, because
> they much differ with iter >= 4. Looks like, the reason is in uncached disk IO).
> 
> 
> Before:
> 
>   trans. |  tps (i)     | tps (e)
> --------------------------------------
> 12098226 | 60491.067392 | 60500.886373
> 12030184 | 60150.874285 | 60160.654295
> 11882977 | 59414.829150 | 59424.830637
> 12020125 | 60100.579023 | 60111.600176
> 12161917 | 60809.547906 | 60827.321639
> 12154660 | 60773.249254 | 60783.085165
> 
> After:
> 
>  trans.  | tps (i)      | tps (e)
> --------------------------------------
> 12770407 | 63849.883578 | 63860.310019      
> 12635366 | 63176.399769 | 63187.152569      
> 12676890 | 63384.396440 | 63400.930755      
> 12639949 | 63199.526330 | 63210.460753      
> 12670626 | 63353.079951 | 63363.274143      
> 12647001 | 63209.613698 | 63219.812331      

All above is pgbench -j 1 -S -c 8 -T 200.
 
> I'm going to test other cases, but could you tell me (if you remember) are there reasons
> we skip prev_cpu, like I described above? Some types of workloads etc.
> 
> ---
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 4df37a4..dfbe06b 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -4930,8 +4930,13 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f
>  	int want_affine = 0;
>  	int sync = wake_flags & WF_SYNC;
>  
> -	if (sd_flag & SD_BALANCE_WAKE)
> -		want_affine = !wake_wide(p) && cpumask_test_cpu(cpu, tsk_cpus_allowed(p));
> +	if (sd_flag & SD_BALANCE_WAKE) {
> +		want_affine = 1;
> +		if (cpu == prev_cpu || !cpumask_test_cpu(cpu, tsk_cpus_allowed(p)))
> +			goto want_affine;
> +		if (wake_wide(p))
> +			goto want_affine;
> +	}
>  
>  	rcu_read_lock();
>  	for_each_domain(cpu, tmp) {
> @@ -4954,16 +4959,12 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f
>  			break;
>  	}
>  
> -	if (affine_sd) {
> +want_affine:
> +	if (want_affine) {
>  		sd = NULL; /* Prefer wake_affine over balance flags */
> -		if (cpu != prev_cpu && wake_affine(affine_sd, p, sync))
> +		if (affine_sd && wake_affine(affine_sd, p, sync))
>  			new_cpu = cpu;
> -	}
> -
> -	if (!sd) {
> -		if (sd_flag & SD_BALANCE_WAKE) /* XXX always ? */
> -			new_cpu = select_idle_sibling(p, new_cpu);
> -
> +		new_cpu = select_idle_sibling(p, new_cpu);
>  	} else while (sd) {
>  		struct sched_group *group;
>  		int weight;
> 

Regards,
Kirill
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/