linux-kernel - Re: [PATCH] sched/fair: Skip wake

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:	Mon, 28 Sep 2015 22:19:08 +0300
From:	Kirill Tkhai <ktkhai@...n.com>
To:	Mike Galbraith <umgwanakikbuti@...il.com>
CC:	<linux-kernel@...r.kernel.org>,
	Peter Zijlstra <peterz@...radead.org>,
	Ingo Molnar <mingo@...hat.com>
Subject: Re: [PATCH] sched/fair: Skip wake_affine() for core siblings



On 28.09.2015 21:22, Mike Galbraith wrote:
> On Mon, 2015-09-28 at 18:36 +0300, Kirill Tkhai wrote:
> 
>> Mike, one more moment. wake_wide() and current logic confuses me a bit.
>> It makes us to decide if we want affine wakeup or not, but select_idle_sibling()
>> if a function is not for choosing this_cpu's llc domain only. We use it
>> for searching in prev_cpu llc domain too, and it seems we are not interested
>> in current flips in this case.
> 
> We're always interested in "flips", as the point is to try to identify
> N:M load components, and when they may overload a socket.  The hope is
> to get it more right than wrong, as making the tracking really accurate
> is too expensive for the fast path.
> 
>>  Imagine a situation, when we share a mutex
>> with a task on another NUMA node. When the task is realising the mutex
>> it is waking us, but we definitelly won't use affine logic in this case.
> 
> Why not?  A wakeup is a wakeup is a wakeup, they all do the same thing.
> If wake_wide() doesn't NAK an affine wakeup, we ask wake_affine() for
> its opinion, then look for an idle CPU near the waker's CPU if it says
> OK, or near wakee's previous CPU if it says go away. 

But NUMA sd does not have SD_WAKE_AFFINE flag, so this case a new cpu won't
be choosen from previous node. There will be choosen the highest domain
of smp_processor_id(), which has SD_BALANCE_WAKE flag, and the cpu will
be choosen from the idlest group/cpu. And we don't have a deal with old
cache at all. This looks like a completely wrong behaviour...

>> We wake the wakee anywhere and loose hot cache.
> 
> Yeah, sometimes we'll make tasks drag their data to them when we could
> have dragged the task to the data in the name of trying to crank up CPU
> utilization.  At some point, _somebody_ has to drag their data across
> interconnect, but we really don't know if/when the data transport cost
> will pay off in better utilization.
> 
> 	-Mike
> 
> (I'll take a peek at below when damn futexes get done kicking my a$$)

This case, you'll be able to analyze new results below :)

ORIGIN  1               2               3               4               avg
C=1	8607,960451	8344,16111	8381,197569	7991,84102	8331,2900375
C=2	15397,152685	15438,913761	15771,512182	15648,368819	15563,98686175
C=4	30987,860844	31144,127431	31104,153461	30874,292825	31027,60864025
C=8	62447,47612	62179,788923	61534,482204	62787,894021	62237,410317
					
					
PATCHED	1               2               3               4               avg		
C=1	8782,439938	8675,891877	8609,209537	8735,120895	8700,66556175
C=2	16526,31409	16491,650678	16149,594736	16365,630084	16383,297397
C=4	32286,341708	32313,536565	32538,285157	32299,427398	32359,397707
C=8	63860,310019	63187,152569	63400,930755	63210,460753	63414,713524

This is:
# pgbench -j 1 -S -c X -T 200 test

The test machine has no NUMA, and I suppose, NUMA machine will show much better results.
I'll test it tomorrow.

>>  I changed the logic, and
>> tried pgbench 1:8. The results (I threw away 3 first iterations, because
>> they much differ with iter >= 4. Looks like, the reason is in uncached disk IO).
>>
>>
>> Before:
>>
>>   trans. |  tps (i)     | tps (e)
>> --------------------------------------
>> 12098226 | 60491.067392 | 60500.886373
>> 12030184 | 60150.874285 | 60160.654295
>> 11882977 | 59414.829150 | 59424.830637
>> 12020125 | 60100.579023 | 60111.600176
>> 12161917 | 60809.547906 | 60827.321639
>> 12154660 | 60773.249254 | 60783.085165
>>
>> After:
>>
>>  trans.  | tps (i)      | tps (e)
>> --------------------------------------
>> 12770407 | 63849.883578 | 63860.310019      
>> 12635366 | 63176.399769 | 63187.152569      
>> 12676890 | 63384.396440 | 63400.930755      
>> 12639949 | 63199.526330 | 63210.460753      
>> 12670626 | 63353.079951 | 63363.274143      
>> 12647001 | 63209.613698 | 63219.812331      
>>
>> I'm going to test other cases, but could you tell me (if you remember) are there reasons
>> we skip prev_cpu, like I described above? Some types of workloads etc.
>>
>> ---
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index 4df37a4..dfbe06b 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -4930,8 +4930,13 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f
>>  	int want_affine = 0;
>>  	int sync = wake_flags & WF_SYNC;
>>  
>> -	if (sd_flag & SD_BALANCE_WAKE)
>> -		want_affine = !wake_wide(p) && cpumask_test_cpu(cpu, tsk_cpus_allowed(p));
>> +	if (sd_flag & SD_BALANCE_WAKE) {
>> +		want_affine = 1;
>> +		if (cpu == prev_cpu || !cpumask_test_cpu(cpu, tsk_cpus_allowed(p)))
>> +			goto want_affine;
>> +		if (wake_wide(p))
>> +			goto want_affine;
>> +	}
>>  
>>  	rcu_read_lock();
>>  	for_each_domain(cpu, tmp) {
>> @@ -4954,16 +4959,12 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f
>>  			break;
>>  	}
>>  
>> -	if (affine_sd) {
>> +want_affine:
>> +	if (want_affine) {
>>  		sd = NULL; /* Prefer wake_affine over balance flags */
>> -		if (cpu != prev_cpu && wake_affine(affine_sd, p, sync))
>> +		if (affine_sd && wake_affine(affine_sd, p, sync))
>>  			new_cpu = cpu;
>> -	}
>> -
>> -	if (!sd) {
>> -		if (sd_flag & SD_BALANCE_WAKE) /* XXX always ? */
>> -			new_cpu = select_idle_sibling(p, new_cpu);
>> -
>> +		new_cpu = select_idle_sibling(p, new_cpu);
>>  	} else while (sd) {
>>  		struct sched_group *group;
>>  		int weight;
> 
> 

Regards,
Kirill
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/