linux-kernel - Re: [PATCH v3 05/12] sched/fair: Optimize __update_sched

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <20160510022744.GT16093@intel.com>
Date:	Tue, 10 May 2016 10:27:44 +0800
From:	Yuyang Du <yuyang.du@...el.com>
To:	Morten Rasmussen <morten.rasmussen@....com>
Cc:	peterz@...radead.org, mingo@...nel.org,
	linux-kernel@...r.kernel.org, bsegall@...gle.com, pjt@...gle.com,
	vincent.guittot@...aro.org, dietmar.eggemann@....com,
	juri.lelli@....com
Subject: Re: [PATCH v3 05/12] sched/fair: Optimize __update_sched_avg()

On Tue, May 10, 2016 at 09:46:03AM +0100, Morten Rasmussen wrote:
> On Wed, May 04, 2016 at 04:02:46AM +0800, Yuyang Du wrote:
> > __update_sched_avg() has these steps:
> > 
> >   1. add the remainder of the last incomplete period
> >   2. decay old sum
> >   3. accumulate new sum in full periods since last_update_time
> >   4. add the current incomplete period
> >   5. update averages
> > 
> > Previously, we separately computed steps 1, 3, and 4, leading to
> > each one of them ugly in codes and costly in overhead.
> > 
> > For example:
> > 
> >              c1          c3           c4
> >              ^           ^            ^
> >              |           |            |
> >            |<->|<----------------->|<--->|
> >    ... |---x---|------| ... |------|-----x (now)
> > 
> > c1, c3, and c4 are the accumulated (meanwhile decayed) contributions
> > in timing aspect of steps 1, 3, and 4 respectively.
> > 
> > Then the accumulated contribution to load_sum, for example, is:
> > 
> > contrib = c1 * weight * freq_scaled;
> > contrib += c3 * weight * freq_scaled;
> > contrib += c4 * weight * freq_scaled;
> > 
> > Obviously, we can optimize the above as:
> > 
> > contrib = c1 + c3 + c4;
> > contrib *= weight * freq_scaled;
> 
> This isn't obvious to me. After spending quite a bit of time figuring
> what your code actually does, there is more to it than what you describe
> here. As your label names suggest, you don't consider what happens in
> step 2 where contrib is decayed. When and how the individual bits are
> decayed is essential information.
> 
> Your patch partially moves step 2 (on your list above) before step 1.
> So it becomes:
> 
> a. decay old sum
> b. compute the contribution up to the first 1ms boundary (c1)
> c. decay c1 to get c1'
> d. accumulate the full periods (c3) adding them to c1'
> e. add remaining time up until now (c4) to contrib (c3+c1').
> f. scale by weight and arch_scale_{freq,cpu}_capacity() functions and
>    add to sum.
> 
> The basic difference in the computation is that you consolidate the
> scaling into one operation instead of three at the cost of decaying
> twice instead of once. The net result is saving a few multiplications.
> 
> I would recommend that this is made very clear. Deriving the math from
> the code is daunting task for most people.
 
Agreed.

> An alternative optimization strategy would be to leave c1 as it is and
> thereby avoid to decay twice, and only add up c3 and c4 before scaling
> reducing the number of scaling operations from three to two.

I did struggle a lot about this, balancing code simplicity and overhead.
So I favored simplicity, and believe (in my opinion) it is still an
overhead win.
 
> > 
> > After we combine them together, we will have much cleaner codes
> > and less CPU cycles.
> 
> Could you share any numbers backing up that claim?
> 
> I did a couple of runs on arm64 (Juno):
> 
> perf bench sched messaging -g 50 -l 2500' (10 runs):
> 
> tip:	65.545996712 seconds time elapsed	( +-  0.22% )
> patch:	65.209931026 seconds time elapsed	( +-  0.23% )
> 
> Taking the error margins into account the difference there is still an
> improvement, but it is about as small as it can be without getting lost
> in the noise. Is the picture any better on x86?
>
> Whether the code is cleaner is a subjective opinion. The diffstat below
> doesn't really show any benefit, but I think you have slightly more
> comments so I would not judge based on that.
> 
> When it comes to code structure, the new __update_sched_avg() is a lot
> simpler than __update_load_avg(), but that is only due to the fact that
> most of the content has been move to accumulate_sum() and
> __accumulate_sum(). Where we before had all code in a single function
> with fitted on screen in one go, you know have to consider three
> functions and how they work together to figure out what is really going
> on.
> 
> I haven't found any bugs in the code, but IMHO I don't really see the
> point in rewriting the code completely. The result isn't significantly
> simpler than what we have and generates code churn affecting everyone
> working with this code. I think we can improve the existing code more by
> just factoring out the capacity/weight scaling, which would be much less
> intrusive.
> 
> Maybe I'm missing something, but I don't see a strong argument for this
> refactoring.

Do you have a criteria on how much to improve in perf and code merits a
patch?

To me, you want it significant or anything else?
 
> > Signed-off-by: Yuyang Du <yuyang.du@...el.com>
> > ---
> >  kernel/sched/fair.c |  189 ++++++++++++++++++++++++++-------------------------
> >  1 file changed, 95 insertions(+), 94 deletions(-)
> > 
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index a060ef2..495e5f0 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -668,7 +668,7 @@ static unsigned long task_h_load(struct task_struct *p);
> >   */
> >  #define SCHED_AVG_HALFLIFE 32	/* number of periods as a half-life */
> >  #define SCHED_AVG_MAX 47742	/* maximum possible sched avg */
> > -#define SCHED_AVG_MAX_N 345	/* number of full periods to produce SCHED_AVG_MAX */
> > +#define SCHED_AVG_MAX_N 347	/* number of full periods to produce SCHED_AVG_MAX */
> 
> Why +2? I doesn't seem related to anything in this patch or explained
> anywhere.
 
I really should have had a separate patch to explain this. Sorry.

I provided a new program to compute SCHED_AVG_MAX_N or LOAD_AVG_MAX_N, this is
the only difference from Paul's numbers, which definitely deserves a thorough
discussion. I will do it in next version.

> >  
> >  /* Give new sched_entity start runnable values to heavy its load in infant time */
> >  void init_entity_runnable_average(struct sched_entity *se)
> > @@ -2606,7 +2606,7 @@ static const u32 __accumulated_sum_N[] = {
> >  
> >  /*
> >   * Precomputed \Sum y^k { 1<=k<=n, where n%32=0). Values are rolled down to
> > - * lower integers.
> > + * lower integers. Since n < SCHED_AVG_MAX_N, n/SCHED_AVG_HALFLIFE < 11
> >   */
> >  static const u32 __accumulated_sum_N32[] = {
> >  	    0, 23371, 35056, 40899, 43820, 45281,
> > @@ -2616,8 +2616,11 @@ static const u32 __accumulated_sum_N32[] = {
> >  /*
> >   * val * y^n, where y^m ~= 0.5
> >   *
> > - * n is the number of periods past; a period is ~1ms
> > + * n is the number of periods past. A period is ~1ms, so a 32bit
> > + * integer can hold approximately a maximum of 49 (=2^32/1000/3600/24) days.
> > + *
> >   * m is called half-life in exponential decay; here it is SCHED_AVG_HALFLIFE=32.
> > + *
> >   */
> >  static __always_inline u64 __decay_sum(u64 val, u32 n)
> >  {
> 
> The above two hunks seem to belong to some of the previous patches in
> this patch set?

Duh, yes...
 
> > @@ -2649,20 +2652,30 @@ static __always_inline u64 __decay_sum(u64 val, u32 n)
> >   * We can compute this efficiently by combining:
> >   * y^32 = 1/2 with precomputed \Sum 1024*y^n   (where n < 32)
> >   */
> > -static __always_inline u32 __accumulate_sum(u32 n)
> > +static __always_inline u32
> > +__accumulate_sum(u32 periods, u32 period_contrib, u32 remainder)
> >  {
> > -	u32 contrib = 0;
> > +	u32 contrib;
> > +
> > +	if (!periods)
> > +		return remainder - period_contrib;
> >  
> > -	if (likely(n <= SCHED_AVG_HALFLIFE))
> > -		return __accumulated_sum_N[n];
> > -	else if (unlikely(n >= SCHED_AVG_MAX_N))
> > +	if (unlikely(periods >= SCHED_AVG_MAX_N))
> >  		return SCHED_AVG_MAX;
> >  
> > -	/* Since n < SCHED_AVG_MAX_N, n/SCHED_AVG_HALFLIFE < 11 */
> > -	contrib = __accumulated_sum_N32[n/SCHED_AVG_HALFLIFE];
> > -	n %= SCHED_AVG_HALFLIFE;
> > -	contrib = __decay_sum(contrib, n);
> > -	return contrib + __accumulated_sum_N[n];
> > +	remainder += __decay_sum((u64)(1024 - period_contrib), periods);
> > +
> > +	periods -= 1;
> > +	if (likely(periods <= SCHED_AVG_HALFLIFE))
> > +		contrib = __accumulated_sum_N[periods];
> > +	else {
> > +		contrib = __accumulated_sum_N32[periods/SCHED_AVG_HALFLIFE];
> > +		periods %= SCHED_AVG_HALFLIFE;
> > +		contrib = __decay_sum(contrib, periods);
> > +		contrib += __accumulated_sum_N[periods];
> > +	}
> > +
> > +	return contrib + remainder;
> >  }
> >  
> >  #if (SCHED_LOAD_SHIFT - SCHED_LOAD_RESOLUTION) != 10 || SCHED_CAPACITY_SHIFT != 10
> > @@ -2671,6 +2684,55 @@ static __always_inline u32 __accumulate_sum(u32 n)
> >  
> >  #define cap_scale(v, s) ((v)*(s) >> SCHED_CAPACITY_SHIFT)
> >  
> > +static __always_inline u32 accumulate_sum(u64 delta, struct sched_avg *sa,
> > +	struct cfs_rq *cfs_rq, int cpu, unsigned long weight, int running)
> 
> The unit and meaning of 'delta' confused me a lot when reviewing this
> patch. Here it is the time delta since last update in [us] (duration of
> c1+c3+c4).
> 
> > +{
> > +	u32 contrib, periods;
> > +	unsigned long scale_freq, scale_cpu;
> > +
> > +	scale_freq = arch_scale_freq_capacity(NULL, cpu);
> > +	scale_cpu = arch_scale_cpu_capacity(NULL, cpu);
> > +
> > +	delta += sa->period_contrib;
> 
> Here it is extended to include time previous 1ms boundary.
> 
> > +	periods = delta >> 10; /* A period is 1024us (~1ms) */
> > +
> > +	/*
> > +	 * Accumulating *_sum has two steps.
> > +	 *
> > +	 * Step 1: decay old *_sum if we crossed period boundaries.
> > +	 */
> > +	if (periods) {
> > +		sa->load_sum = __decay_sum(sa->load_sum, periods);
> > +		if (cfs_rq) {
> > +			cfs_rq->runnable_load_sum =
> > +				__decay_sum(cfs_rq->runnable_load_sum, periods);
> > +		}
> > +		sa->util_sum = __decay_sum((u64)(sa->util_sum), periods);
> > +	}
> > +
> > +	/*
> > +	 * Step 2: accumulate new *_sum since last_update_time. This at most has
> > +	 * three parts (at least one part): (1) remainder of incomplete last
> > +	 * period, (2) full periods since (1), and (3) incomplete current period.
> > +	 *
> > +	 * Fortunately, we can (and should) do all these three at once.
> > +	 */
> > +	delta %= 1024;
> 
> Now 'delta' is any leftover contribution to the current 1ms period
> (duration of c4).
> 
> > +	contrib = __accumulate_sum(periods, sa->period_contrib, delta);
> > +	sa->period_contrib = delta;
> > +
> > +	contrib = cap_scale(contrib, scale_freq);
> > +	if (weight) {
> > +		sa->load_sum += weight * contrib;
> > +		if (cfs_rq)
> > +			cfs_rq->runnable_load_sum += weight * contrib;
> > +	}
> > +	if (running)
> > +		sa->util_sum += contrib * scale_cpu;
> > +
> > +	return periods;
> > +}
> > +
> >  /*
> >   * We can represent the historical contribution to sched average as the
> >   * coefficients of a geometric series.  To do this we divide the history
> > @@ -2701,12 +2763,9 @@ static __always_inline u32 __accumulate_sum(u32 n)
> >   */
> >  static __always_inline int
> >  __update_sched_avg(u64 now, int cpu, struct sched_avg *sa,
> > -		  unsigned long weight, int running, struct cfs_rq *cfs_rq)
> > +		   unsigned long weight, int running, struct cfs_rq *cfs_rq)
> >  {
> > -	u64 delta, scaled_delta;
> > -	u32 contrib, periods;
> > -	unsigned int delta_w, scaled_delta_w, decayed = 0;
> > -	unsigned long scale_freq, scale_cpu;
> > +	u64 delta;
> >  
> >  	delta = now - sa->last_update_time;
> 
> 'delta' is true delta, but in [ns] here.
> 
> I would suggest clearly specifying what each parameter is and its unit
> for each of the helper functions to help people a bit.

Agreed, and I reused the variable.