linux-kernel - Re: [PATCH] sched/fair: Skip wake

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <1443464557.2780.72.camel@gmail.com>
Date:	Mon, 28 Sep 2015 20:22:37 +0200
From:	Mike Galbraith <umgwanakikbuti@...il.com>
To:	Kirill Tkhai <ktkhai@...n.com>
Cc:	linux-kernel@...r.kernel.org,
	Peter Zijlstra <peterz@...radead.org>,
	Ingo Molnar <mingo@...hat.com>
Subject: Re: [PATCH] sched/fair: Skip wake_affine() for core siblings

On Mon, 2015-09-28 at 18:36 +0300, Kirill Tkhai wrote:

> Mike, one more moment. wake_wide() and current logic confuses me a bit.
> It makes us to decide if we want affine wakeup or not, but select_idle_sibling()
> if a function is not for choosing this_cpu's llc domain only. We use it
> for searching in prev_cpu llc domain too, and it seems we are not interested
> in current flips in this case.

We're always interested in "flips", as the point is to try to identify
N:M load components, and when they may overload a socket.  The hope is
to get it more right than wrong, as making the tracking really accurate
is too expensive for the fast path.

>  Imagine a situation, when we share a mutex
> with a task on another NUMA node. When the task is realising the mutex
> it is waking us, but we definitelly won't use affine logic in this case.

Why not?  A wakeup is a wakeup is a wakeup, they all do the same thing.
If wake_wide() doesn't NAK an affine wakeup, we ask wake_affine() for
its opinion, then look for an idle CPU near the waker's CPU if it says
OK, or near wakee's previous CPU if it says go away. 

> We wake the wakee anywhere and loose hot cache.

Yeah, sometimes we'll make tasks drag their data to them when we could
have dragged the task to the data in the name of trying to crank up CPU
utilization.  At some point, _somebody_ has to drag their data across
interconnect, but we really don't know if/when the data transport cost
will pay off in better utilization.

	-Mike

(I'll take a peek at below when damn futexes get done kicking my a$$)

>  I changed the logic, and
> tried pgbench 1:8. The results (I threw away 3 first iterations, because
> they much differ with iter >= 4. Looks like, the reason is in uncached disk IO).
> 
> 
> Before:
> 
>   trans. |  tps (i)     | tps (e)
> --------------------------------------
> 12098226 | 60491.067392 | 60500.886373
> 12030184 | 60150.874285 | 60160.654295
> 11882977 | 59414.829150 | 59424.830637
> 12020125 | 60100.579023 | 60111.600176
> 12161917 | 60809.547906 | 60827.321639
> 12154660 | 60773.249254 | 60783.085165
> 
> After:
> 
>  trans.  | tps (i)      | tps (e)
> --------------------------------------
> 12770407 | 63849.883578 | 63860.310019      
> 12635366 | 63176.399769 | 63187.152569      
> 12676890 | 63384.396440 | 63400.930755      
> 12639949 | 63199.526330 | 63210.460753      
> 12670626 | 63353.079951 | 63363.274143      
> 12647001 | 63209.613698 | 63219.812331      
> 
> I'm going to test other cases, but could you tell me (if you remember) are there reasons
> we skip prev_cpu, like I described above? Some types of workloads etc.
> 
> ---
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 4df37a4..dfbe06b 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -4930,8 +4930,13 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f
>  	int want_affine = 0;
>  	int sync = wake_flags & WF_SYNC;
>  
> -	if (sd_flag & SD_BALANCE_WAKE)
> -		want_affine = !wake_wide(p) && cpumask_test_cpu(cpu, tsk_cpus_allowed(p));
> +	if (sd_flag & SD_BALANCE_WAKE) {
> +		want_affine = 1;
> +		if (cpu == prev_cpu || !cpumask_test_cpu(cpu, tsk_cpus_allowed(p)))
> +			goto want_affine;
> +		if (wake_wide(p))
> +			goto want_affine;
> +	}
>  
>  	rcu_read_lock();
>  	for_each_domain(cpu, tmp) {
> @@ -4954,16 +4959,12 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f
>  			break;
>  	}
>  
> -	if (affine_sd) {
> +want_affine:
> +	if (want_affine) {
>  		sd = NULL; /* Prefer wake_affine over balance flags */
> -		if (cpu != prev_cpu && wake_affine(affine_sd, p, sync))
> +		if (affine_sd && wake_affine(affine_sd, p, sync))
>  			new_cpu = cpu;
> -	}
> -
> -	if (!sd) {
> -		if (sd_flag & SD_BALANCE_WAKE) /* XXX always ? */
> -			new_cpu = select_idle_sibling(p, new_cpu);
> -
> +		new_cpu = select_idle_sibling(p, new_cpu);
>  	} else while (sd) {
>  		struct sched_group *group;
>  		int weight;


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/