linux-kernel - Re: [RFC PATCH 3/3] sched/fair: Use different cachelines for readers and writers of load

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20151130102240.GH17308@twins.programming.kicks-ass.net>
Date:	Mon, 30 Nov 2015 11:22:40 +0100
From:	Peter Zijlstra <peterz@...radead.org>
To:	Waiman Long <Waiman.Long@....com>
Cc:	Ingo Molnar <mingo@...hat.com>, linux-kernel@...r.kernel.org,
	Scott J Norton <scott.norton@....com>,
	Douglas Hatch <doug.hatch@....com>,
	Paul Turner <pjt@...gle.com>, Ben Segall <bsegall@...gle.com>,
	Morten Rasmussen <morten.rasmussen@....com>,
	Yuyang Du <yuyang.du@...el.com>
Subject: Re: [RFC PATCH 3/3] sched/fair: Use different cachelines for readers
 and writers of load_avg


Please always Cc the people who wrote the code.

+CC pjt, ben, morten, yuyang

On Wed, Nov 25, 2015 at 02:09:40PM -0500, Waiman Long wrote:
> The load_avg statistical counter is only changed if the load on a CPU
> deviates significantly from the previous tick. So there are usually
> more readers than writers of load_avg. Still, on a large system,
> the cacheline contention can cause significant slowdown and impact
> performance.
> 
> This patch attempts to separate those load_avg readers
> (update_cfs_shares) and writers (task_tick_fair) to use different
> cachelines instead. Writers of load_avg will now accumulates the
> load delta into load_avg_delta which sits in a different cacheline.
> If load_avg_delta is sufficiently large (> load_avg/64), it will then
> be added back to load_avg.
> 
> Running a java benchmark on a 16-socket IvyBridge-EX system (240 cores,
> 480 threads), the perf profile before the patch was:
> 
>    9.44%   0.00%  java  [kernel.vmlinux]  [k] smp_apic_timer_interrupt
>    8.74%   0.01%  java  [kernel.vmlinux]  [k] hrtimer_interrupt
>    7.83%   0.03%  java  [kernel.vmlinux]  [k] tick_sched_timer
>    7.74%   0.00%  java  [kernel.vmlinux]  [k] update_process_times
>    7.27%   0.03%  java  [kernel.vmlinux]  [k] scheduler_tick
>    5.94%   1.74%  java  [kernel.vmlinux]  [k] task_tick_fair
>    4.15%   3.92%  java  [kernel.vmlinux]  [k] update_cfs_shares
> 
> After the patch, it became:
> 
>    2.94%   0.00%  java  [kernel.vmlinux]  [k] smp_apic_timer_interrupt
>    2.52%   0.01%  java  [kernel.vmlinux]  [k] hrtimer_interrupt
>    2.25%   0.02%  java  [kernel.vmlinux]  [k] tick_sched_timer
>    2.21%   0.00%  java  [kernel.vmlinux]  [k] update_process_times
>    1.70%   0.03%  java  [kernel.vmlinux]  [k] scheduler_tick
>    0.96%   0.34%  java  [kernel.vmlinux]  [k] task_tick_fair
>    0.61%   0.48%  java  [kernel.vmlinux]  [k] update_cfs_shares

This begs the question tough; why are you running a global load in a
cgroup; and do we really need to update this for the root cgroup? It
seems to me we don't need calc_tg_weight() for the root cgroup, it
doesn't need to normalize its weight numbers.

That is; isn't this simply a problem we should avoid?

> The benchmark results before and after the patch were as follows:
> 
>   Before patch - Max-jOPs: 916011    Critical-jOps: 142366
>   AFter patch  - Max-jOPs: 939130    Critical-jOps: 211937
> 
> There was significant improvement in Critical-jOps which was latency
> sensitive.
> 
> This patch does introduce additional delay in getting the real load
> average reflected in load_avg. It may also incur additional overhead
> if the number of CPUs in a task group is small. As a result, this
> change is only activated when running on a 4-socket or larger systems
> which can get the most benefit from it.

So I'm not particularly charmed by this; it rather makes a mess of
things. Also this really wants a run of the cgroup fairness test thingy
pjt/ben have somewhere.

> Signed-off-by: Waiman Long <Waiman.Long@....com>
> ---
>  kernel/sched/core.c  |    9 +++++++++
>  kernel/sched/fair.c  |   30 ++++++++++++++++++++++++++++--
>  kernel/sched/sched.h |    8 ++++++++
>  3 files changed, 45 insertions(+), 2 deletions(-)
> 
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 4d568ac..f3075da 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -7356,6 +7356,12 @@ void __init sched_init(void)
>  		root_task_group.cfs_rq = (struct cfs_rq **)ptr;
>  		ptr += nr_cpu_ids * sizeof(void **);
>  
> +#ifdef CONFIG_SMP
> +		/*
> +		 * Use load_avg_delta if not 2P or less
> +		 */
> +		root_task_group.use_la_delta = (num_possible_nodes() > 2);
> +#endif /* CONFIG_SMP */
>  #endif /* CONFIG_FAIR_GROUP_SCHED */
>  #ifdef CONFIG_RT_GROUP_SCHED
>  		root_task_group.rt_se = (struct sched_rt_entity **)ptr;
> @@ -7691,6 +7697,9 @@ struct task_group *sched_create_group(struct task_group *parent)
>  	if (!alloc_rt_sched_group(tg, parent))
>  		goto err;
>  
> +#if defined(CONFIG_FAIR_GROUP_SCHED) && defined(CONFIG_SMP)
> +	tg->use_la_delta = root_task_group.use_la_delta;
> +#endif
>  	return tg;
>  
>  err:
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 8f1eccc..44732cc 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -2663,15 +2663,41 @@ __update_load_avg(u64 now, int cpu, struct sched_avg *sa,
>  
>  #ifdef CONFIG_FAIR_GROUP_SCHED
>  /*
> - * Updating tg's load_avg is necessary before update_cfs_share (which is done)
> + * Updating tg's load_avg is necessary before update_cfs_shares (which is done)
>   * and effective_load (which is not done because it is too costly).
> + *
> + * The tg's use_la_delta flag, if set, will cause the load_avg delta to be
> + * accumulated into the load_avg_delta variable instead to reduce cacheline
> + * contention on load_avg at the expense of more delay in reflecting the real
> + * load_avg. The tg's load_avg and load_avg_delta variables are in separate
> + * cachelines. With that flag set, load_avg will be read mostly whereas
> + * load_avg_delta will be write mostly.
>   */
>  static inline void update_tg_load_avg(struct cfs_rq *cfs_rq, int force)
>  {
>  	long delta = cfs_rq->avg.load_avg - cfs_rq->tg_load_avg_contrib;
>  
>  	if (force || abs(delta) > cfs_rq->tg_load_avg_contrib / 64) {
> -		atomic_long_add(delta, &cfs_rq->tg->load_avg);
> +		struct task_group *tg = cfs_rq->tg;
> +		long load_avg, tot_delta;
> +
> +		if (!tg->use_la_delta) {
> +			/*
> +			 * If the use_la_delta isn't set, just add the
> +			 * delta directly into load_avg.
> +			 */
> +			atomic_long_add(delta, &tg->load_avg);
> +			goto set_contrib;
> +		}
> +
> +		tot_delta = atomic_long_add_return(delta, &tg->load_avg_delta);
> +		load_avg = atomic_long_read(&tg->load_avg);
> +		if (abs(tot_delta) > load_avg / 64) {
> +			tot_delta = atomic_long_xchg(&tg->load_avg_delta, 0);
> +			if (tot_delta)
> +				atomic_long_add(tot_delta, &tg->load_avg);
> +		}
> +set_contrib:
>  		cfs_rq->tg_load_avg_contrib = cfs_rq->avg.load_avg;
>  	}
>  }

I'm thinking that its now far too big to retain the inline qualifier.

> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index e679895..aef4e4e 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -252,8 +252,16 @@ struct task_group {
>  	 * load_avg can be heavily contended at clock tick time, so put
>  	 * it in its own cacheline separated from the fields above which
>  	 * will also be accessed at each tick.
> +	 *
> +	 * The use_la_delta flag, if set, will enable the use of load_avg_delta
> +	 * to accumulate the delta and only change load_avg when the delta
> +	 * is big enough. This reduces the cacheline contention on load_avg.
> +	 * This flag will be set at allocation time depending on the system
> +	 * configuration.
>  	 */
> +	int use_la_delta;
>  	atomic_long_t load_avg ____cacheline_aligned;
> +	atomic_long_t load_avg_delta ____cacheline_aligned;

This would only work if the structure itself is allocated with cacheline
alignment, and looking at sched_create_group(), we use a plain kzalloc()
for this, which doesn't guarantee any sort of alignment beyond machine
word size IIRC.

Also, you unconditionally grow the structure by a whole cacheline.

>  #endif
>  #endif
>  
> -- 
> 1.7.1
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/