[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <52DFFC44.4030104@hp.com>
Date: Wed, 22 Jan 2014 12:13:40 -0500
From: Waiman Long <waiman.long@...com>
To: bsegall@...gle.com
CC: Ingo Molnar <mingo@...hat.com>,
Peter Zijlstra <peterz@...radead.org>,
linux-kernel@...r.kernel.org,
"Paul E. McKenney" <paulmck@...ux.vnet.ibm.com>,
Frederic Weisbecker <fweisbec@...il.com>,
"Eric W. Biederman" <ebiederm@...ssion.com>,
Andrew Morton <akpm@...ux-foundation.org>,
Serge Hallyn <serge.hallyn@...onical.com>,
Aswin Chandramouleeswaran <aswin@...com>,
Scott J Norton <scott.norton@...com>
Subject: Re: [PATCH v2] sched: reduce contention on tg's load_avg & runnable_avg
On 01/16/2014 01:21 PM, bsegall@...gle.com wrote:
> Waiman Long<Waiman.Long@...com> writes:
>
>> It was found that with a perf profile of a compute workload (at 1500
>> users) of the AIM7 benchmark running on a glueless 4-socket 40-core
>> Westmere-EX system (HT on) on a 3.13-rc8 kernel that the scheduling
>> tick related functions account for quite a significant portion of
>> the total kernel cpu cycles.
>>
>> 0.62% reaim [kernel.kallsyms] [k] update_cfs_rq_blocked_load
>> 0.47% reaim [kernel.kallsyms] [k] entity_tick
>> 0.10% reaim [kernel.kallsyms] [k] update_cfs_shares
>> 0.03% reaim [kernel.kallsyms] [k] update_curr
>>
>> The scheduling tick functions account for about 1.22% of the total
>> CPU cycles. Of the top 2 function in the above list, the reading
>> and writing of the tg->load_avg variable account for over 90% of the
>> CPU cycles:
>>
>> atomic_long_add(tg_contrib,&tg->load_avg);
>> atomic_long_read(&tg->load_avg) + 1);
>>
>> This patch reduces the contention on the load_avg variable (and
>> secondarily on the runnable_avg variable) by the following 2 measures:
>>
>> 1. Make the load_avg and runnable_avg fields of the task_group
>> structure sit in their own cacheline without sharing it with others.
>> This only applies if the kernel is built for NUMA systems with
>> multiple sockets.
> How much of the benefit comes from this (and how much for load_avg vs
> runnable_avg vs just one separate cache_line for the pair)?
Below are the performance data for different cacheline placement:
Cacheline Placement | %CPU | JPM |
---------------------+-------+--------+
2 separate cachelines| 0.55% | 405803 |
1 common cacheline | 1.01% | 403462 |
2nd change only | 1.06% | 403820 |
Original code | 1.22% | 398509 |
It seems like forcing the 2 fields to be in the same cacheline actually
make it perform a little bit worse. It is likely that the 2 fields were
actually in 2 different cacheline in x86.
>> 2. Use atomic_long_add_return() to update the fields and save the
>> returned value in a temporary location in the cfs structure to
>> be used later instead of reading the fields directly.
>>
> This is safe for tg->runnable_avg, as it only lasts for one line of
> __update_entity_load_avg_contrib, and is never used for rq->cfs. That
> said, given that it is such a short and contained duration it seems
> simpler to just pass it around in __update_entity_load_avg_contrib
> rather than make a new field on cfs_rq.
Thank for the suggestion, I will look into that.
>> The second change does require some changes in the ordering of how
>> some of the average counts are being computed and hence may have a
>> slight effect on their behavior.
>>
>> With these 2 changes, the perf profile becomes:
>>
>> 0.42% reaim [kernel.kallsyms] [k] update_cfs_rq_blocked_load
>> 0.05% reaim [kernel.kallsyms] [k] update_cfs_shares
>> 0.04% reaim [kernel.kallsyms] [k] update_curr
>> 0.04% reaim [kernel.kallsyms] [k] entity_tick
>>
>> The %CPU cycle is reduced to about 0.55%. It is not a big change,
>> but it did improve the compute benchmark slightly from 398509 JPM
>> (Jobs/Minute) to 405803 JPM which is about 2% improvement and reduced
>> the reported systime from 50.03s to 48.37s.
>>
>> Signed-off-by: Waiman Long<Waiman.Long@...com>
>> ---
>> kernel/sched/fair.c | 29 ++++++++++++++++++++++-------
>> kernel/sched/sched.h | 14 ++++++++++++--
>> 2 files changed, 34 insertions(+), 9 deletions(-)
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index c7395d9..c4aa86d 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -1868,7 +1868,10 @@ static inline long calc_tg_weight(struct task_group *tg, struct cfs_rq *cfs_rq)
>> * to gain a more accurate current total weight. See
>> * update_cfs_rq_load_contribution().
>> */
>> - tg_weight = atomic_long_read(&tg->load_avg);
>> + /* Use the saved version of tg's load_avg, if available */
>> + tg_weight = cfs_rq->tg_load_save;
>> + if (!tg_weight)
>> + tg_weight = atomic_long_read(&tg->load_avg);
>> tg_weight -= cfs_rq->tg_load_contrib;
>> tg_weight += cfs_rq->load.weight;
>>
>> @@ -2155,7 +2158,8 @@ static inline void __update_cfs_rq_tg_load_contrib(struct cfs_rq *cfs_rq,
>> tg_contrib -= cfs_rq->tg_load_contrib;
>>
>> if (force_update || abs(tg_contrib)> cfs_rq->tg_load_contrib / 8) {
>> - atomic_long_add(tg_contrib,&tg->load_avg);
>> + cfs_rq->tg_load_save =
>> + atomic_long_add_return(tg_contrib,&tg->load_avg);
>> cfs_rq->tg_load_contrib += tg_contrib;
>> }
>> }
>> @@ -2176,7 +2180,8 @@ static inline void __update_tg_runnable_avg(struct sched_avg *sa,
>> contrib -= cfs_rq->tg_runnable_contrib;
>>
>> if (abs(contrib)> cfs_rq->tg_runnable_contrib / 64) {
>> - atomic_add(contrib,&tg->runnable_avg);
>> + cfs_rq->tg_runnable_save =
>> + atomic_add_return(contrib,&tg->runnable_avg);
>> cfs_rq->tg_runnable_contrib += contrib;
>> }
>> }
>> @@ -2186,12 +2191,19 @@ static inline void __update_group_entity_contrib(struct sched_entity *se)
>> struct cfs_rq *cfs_rq = group_cfs_rq(se);
>> struct task_group *tg = cfs_rq->tg;
>> int runnable_avg;
>> + long load_avg;
>>
>> u64 contrib;
>>
>> contrib = cfs_rq->tg_load_contrib * tg->shares;
>> - se->avg.load_avg_contrib = div_u64(contrib,
>> - atomic_long_read(&tg->load_avg) + 1);
>> + /*
>> + * Retrieve& clear the saved tg's load_avg and use it if not 0
>> + */
>> + load_avg = cfs_rq->tg_load_save;
>> + cfs_rq->tg_load_save = 0;
>> + if (unlikely(!load_avg))
>> + load_avg = atomic_long_read(&tg->load_avg);
>> + se->avg.load_avg_contrib = div_u64(contrib, load_avg + 1);
>>
>> /*
>> * For group entities we need to compute a correction term in the case
>> @@ -2216,7 +2228,10 @@ static inline void __update_group_entity_contrib(struct sched_entity *se)
>> * of consequential size guaranteed to see n_i*w_i quickly converge to
>> * our upper bound of 1-cpu.
>> */
>> - runnable_avg = atomic_read(&tg->runnable_avg);
>> + runnable_avg = cfs_rq->tg_runnable_save;
>> + cfs_rq->tg_runnable_save = 0;
>> + if (unlikely(!runnable_avg))
>> + runnable_avg = atomic_read(&tg->runnable_avg);
>> if (runnable_avg< NICE_0_LOAD) {
>> se->avg.load_avg_contrib *= runnable_avg;
>> se->avg.load_avg_contrib>>= NICE_0_SHIFT;
>> @@ -2823,9 +2838,9 @@ entity_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr, int queued)
>> /*
>> * Ensure that runnable average is periodically updated.
>> */
>> - update_entity_load_avg(curr, 1);
>> update_cfs_rq_blocked_load(cfs_rq, 1);
>> update_cfs_shares(cfs_rq);
>> + update_entity_load_avg(curr, 1);
> You've confused group_cfs_rq(curr) and cfs_rq=cfs_rq_of(curr) here -
> there is no need to do this accuracy-reducing reordering.
> update_cfs_rq_blocked_load would set cfs_rq->tg_load_save, and then
> entity_tick(curr->parent) called this same tick would read this value,
> the same way enqueue/dequeue will do what you wanted.
I will try to do it without reordering calls here.
> That said, there is still a problem that tg_load_save could escape in
> cases where __update_entity_load_avg_contrib gets skipped, either via
> __update_entity_load_avg_contrib not crossing a boundary or
> enqueue/dequeue aborting early due to cfs_rq_throttled. Worst case
> should be accessing a value ~1ms old though, which might be acceptable.
Will provide a more detailed analysis of all possible cases in the next
version of the patch.
-Longman
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists