linux-kernel - Re: [PATCH] sched/fair: Skip wake

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <56091651.6070607@odin.com>
Date:	Mon, 28 Sep 2015 13:28:33 +0300
From:	Kirill Tkhai <ktkhai@...n.com>
To:	Mike Galbraith <umgwanakikbuti@...il.com>
CC:	<linux-kernel@...r.kernel.org>,
	Peter Zijlstra <peterz@...radead.org>,
	Ingo Molnar <mingo@...hat.com>
Subject: Re: [PATCH] sched/fair: Skip wake_affine() for core siblings


On 26.09.2015 18:25, Mike Galbraith wrote:
> On Fri, 2015-09-25 at 20:54 +0300, Kirill Tkhai wrote:
>> We are not interested in actual target if both prev
>> and curr cpus share CPU cache. select_idle_sibling()
>> searches in top-down order; top level is the same
>> for both of them, and the result will be the same.
>> So, we can save a little CPU cycles and cache misses
>> and skip wake_affine() calculations.
> 
> But, whereas previously wake_affine() could NAK a migration if it would
> create an imbalance, we'll now just go ahead and stack tasks if
> select_idle_sibling() can't find an idle home to override the blanket
> approval.  It doesn't look like a good idea to me to bounce tasks around
> only to then perhaps stack them, as if we do stack waker/wakee, we
> certainly lose concurrency. (microbenchmarks like pipe-test love that,
> but not all that many real applications play ping-pong for a living;)
> 
> I spent most of the day piddling with your little patch, so I'll post
> some condensed mixed load notes.
> 
> concurrent tbench 4 + pgbench, 30 seconds per client count (i4790+smt)
>                                              master                           master+
> pgbench                   1       2       3     avg         1       2       3     avg   comp
> clients 1       tps = 18768   18591   18264   18541     18351   17257   17245   17617   .950
> clients 2       tps = 30779   30661   31016   30818     29112   28026   29026   28721   .931
> clients 4       tps = 54195   55100   54048   54447     53290   52336   52930   52852   .970
> clients 8       tps = 60332   67052   64699   64027     38491   35746   37746   37327   .582!!

Yeah, this is terrible.

> Do the opposite, wake_affine() always NAKs.
>                                              master                           master++
> pgbench                   1       2       3     avg         1       2       3     avg   comp
> clients 1       tps = 18768   18591   18264   18541     16874   16865   16665   16801   .906
> clients 2       tps = 30779   30661   31016   30818     33562   33546   33681   33596  1.090
> clients 4       tps = 54195   55100   54048   54447     61544   61482   61117   61381  1.127
> clients 8       tps = 60332   67052   64699   64027     75171   75524   75318   75337  1.176

Looks like, NAK may be better, because it saves L1 cache, while the patch always invalidates it.

Could you say, do you execute pgbench using just -cX -jY -T30 or something special? I've tried it,
but the dispersion of the results much differs from time to time.

> 
> ...
> 
> virgin vs your patch again, 2 _minutes_ per client count, as I noticed much variance at 8
> clients, where wake_wide() is supposed to kick in to keep N:M load spread out.
> 
>                                              master                           master+
> pgbench                   1       2       3     avg         1       2       3     avg   comp
> clients 1       tps = 18548   18673   18390   18537     17879   17652   17621   17717   .955
> clients 2       tps = 31083   31110   30859   31017     30274   30003   29796   30024   .967
> clients 4       tps = 53107   53156   53601   53288     52658   53024   53449   53043   .995
> clients 8       tps = 34213   34310   28844   32455     31360   31416   30732   31169   .960
> 
> 30 seconds per run isn't enough, and wake_wide() is not doing a wonderful job for 1:N pgbench.
> 
> hrmph, twiddle...
> 
> waker/wakee coupling strengthened
> postgres@...er:~> pgbench.sh
> clients 1       tps = 18035
> clients 2       tps = 32525
> clients 4       tps = 53246
> clients 8       tps = 37278
> 
> better, but not enough..  + sd_llc_size = #cores vs #threads
> postgres@...er:~> pgbench.sh
> clients 1       tps = 18482
> clients 2       tps = 32366
> clients 4       tps = 54557
> clients 8       tps = 69643
> 
> Ok, that's what I want to see, full repeat.
> master = twiddle
> master+ = twiddle+patch
> 
> concurrent tbench 4 + pgbench, 2 minutes per client count (i4790+smt)
>                                              master                           master+
> pgbench                   1       2       3     avg         1       2       3     avg   comp
> clients 1       tps = 18599   18627   18532   18586     17480   17682   17606   17589   .946
> clients 2       tps = 32344   32313   32408   32355     25167   26140   23730   25012   .773
> clients 4       tps = 52593   51390   51095   51692     22983   23046   22427   22818   .441
> clients 8       tps = 70354   69583   70107   70014     66924   66672   69310   67635   .966
> 
> Hrm... turn the tables, measure tbench while pgbench 4 client load runs endlessly.
> 
>                                              master                           master+
> tbench                    1       2       3     avg         1       2       3     avg   comp
> pairs 1        MB/s =   430     426     436     430       481     481     494     485  1.127
> pairs 2        MB/s =  1083    1085    1072    1080      1086    1090    1083    1086  1.005
> pairs 4        MB/s =  1725    1697    1729    1717      2023    2002    2006    2010  1.170
> pairs 8        MB/s =  2740    2631    2700    2690      3016    2977    3071    3021  1.123
> 
> tbench without competition
>                master        master+   comp
> pairs 1        MB/s =   694     692    .997 
> pairs 2        MB/s =  1268    1259    .992
> pairs 4        MB/s =  2210    2165    .979
> pairs 8        MB/s =  3586    3526    .983  (yawn, all within routine variance)

Hm, it seems tbench with competition is better only because of a busy system makes tbench
processes be woken on the same cpu.
 
> twiddle:
> 
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -6048,14 +6048,18 @@ static void update_top_cache_domain(int
>  {
>  	struct sched_domain *sd;
>  	struct sched_domain *busy_sd = NULL;
> +	struct sched_group *group;
>  	int id = cpu;
>  	int size = 1;
>  
>  	sd = highest_flag_domain(cpu, SD_SHARE_PKG_RESOURCES);
>  	if (sd) {
>  		id = cpumask_first(sched_domain_span(sd));
> -		size = cpumask_weight(sched_domain_span(sd));
>  		busy_sd = sd->parent; /* sd_busy */
> +		group = sd->groups;
> +		/* Set size to the number of cores, not threads */
> +		while (group = group->next, group != sd->groups)
> +			size++;
>  	}
>  	rcu_assign_pointer(per_cpu(sd_busy, cpu), busy_sd);
>  
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -4421,19 +4421,26 @@ static unsigned long cpu_avg_load_per_ta
>  
>  static void record_wakee(struct task_struct *p)
>  {
> +	unsigned long now = jiffies;
> +
>  	/*
>  	 * Rough decay (wiping) for cost saving, don't worry
>  	 * about the boundary, really active task won't care
>  	 * about the loss.
>  	 */
> -	if (time_after(jiffies, current->wakee_flip_decay_ts + HZ)) {
> +	if (time_after(now, current->wakee_flip_decay_ts + HZ)) {
>  		current->wakee_flips >>= 1;
> -		current->wakee_flip_decay_ts = jiffies;
> +		current->wakee_flip_decay_ts = now;
> +	}
> +	if (time_after(now, p->wakee_flip_decay_ts + HZ)) {
> +		p->wakee_flips >>= 1;
> +		p->wakee_flip_decay_ts = now;
>  	}
>  
>  	if (current->last_wakee != p) {
>  		current->last_wakee = p;
>  		current->wakee_flips++;
> +		p->wakee_flips++;
>  	}
>  }
>  
> 

Regards,
Kirill
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/