linux-kernel - Re: [PATCH 2/4] sched/fair: Drop out incomplete current period when sched averages accrue

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20160412115645.GA3730@vingu-laptop>
Date:	Tue, 12 Apr 2016 13:56:45 +0200
From:	Vincent Guittot <vincent.guittot@...aro.org>
To:	Yuyang Du <yuyang.du@...el.com>
Cc:	Peter Zijlstra <peterz@...radead.org>,
	Ingo Molnar <mingo@...nel.org>,
	linux-kernel <linux-kernel@...r.kernel.org>,
	Benjamin Segall <bsegall@...gle.com>,
	Paul Turner <pjt@...gle.com>,
	Morten Rasmussen <morten.rasmussen@....com>,
	Dietmar Eggemann <dietmar.eggemann@....com>,
	Juri Lelli <juri.lelli@....com>
Subject: Re: [PATCH 2/4] sched/fair: Drop out incomplete current period when
 sched averages accrue

Le Tuesday 12 Apr 2016 à 03:41:41 (+0800), Yuyang Du a écrit :
> Hi Vincent,
> 
> On Mon, Apr 11, 2016 at 11:08:04AM +0200, Vincent Guittot wrote:
> > > @@ -2704,11 +2694,14 @@ static __always_inline int
> > >  __update_load_avg(u64 now, int cpu, struct sched_avg *sa,
> > >                   unsigned long weight, int running, struct cfs_rq *cfs_rq)
> > >  {
> > > -       u64 delta, scaled_delta, periods;
> > > -       u32 contrib;
> > > -       unsigned int delta_w, scaled_delta_w, decayed = 0;
> > > +       u64 delta;
> > > +       u32 contrib, periods;
> > >         unsigned long scale_freq, scale_cpu;
> > >
> > > +       /*
> > > +        * now rolls down to a period boundary
> > > +        */
> > > +       now = now && (u64)(~0xFFFFF);
> > >         delta = now - sa->last_update_time;
> > >         /*
> > >          * This should only happen when time goes backwards, which it
> > > @@ -2720,89 +2713,56 @@ __update_load_avg(u64 now, int cpu, struct sched_avg *sa,
> > >         }
> > >
> > >         /*
> > > -        * Use 1024ns as the unit of measurement since it's a reasonable
> > > -        * approximation of 1us and fast to compute.
> > > +        * Use 1024*1024ns as an approximation of 1ms period, pretty close.
> > >          */
> > > -       delta >>= 10;
> > > -       if (!delta)
> > > +       periods = delta >> 20;
> > > +       if (!periods)
> > >                 return 0;
> > >         sa->last_update_time = now;
> > 
> > The optimization looks quite interesting but I see one potential issue
> > with migration as we will lose the part of the ongoing period that is
> > now not saved anymore. This lost part can be quite significant for a
> > short task that ping pongs between CPUs.
> 
> Yes, basically, it is we lose precision (~1ms scale in contrast with ~1us scale).

But with a HZ set to 1000 and a sched-slice in the same range, having a precision
of 1ms instead of 1us makes the precision of load tracking of short task quite
random IMHO.

you can keep recording this partial period without using it in the load tracking.
Something like below keep precision without sacrifying the optimization.

---
 kernel/sched/fair.c | 16 ++++++++++++----
 1 file changed, 12 insertions(+), 4 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 68273e8..b234169 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -674,6 +674,12 @@ void init_entity_runnable_average(struct sched_entity *se)
 	struct sched_avg *sa = &se->avg;
 
 	sa->last_update_time = 0;
+	/*
+	 * sched_avg's period_contrib should be strictly less then 1024 * 1024, so
+	 * we give it 1023 * 1024 to make sure it is almost a period (1024us), and
+	 * will definitely be updated (after enqueue).
+	 */
+	sa->period_contrib = 1023*1024;
 	sa->load_avg = scale_load_down(se->load.weight);
 	sa->load_sum = sa->load_avg * LOAD_AVG_MAX;
 	/*
@@ -2698,10 +2704,6 @@ __update_load_avg(u64 now, int cpu, struct sched_avg *sa,
 	u32 contrib, periods;
 	unsigned long scale_freq, scale_cpu;
 
-	/*
-	 * now rolls down to a period boundary
-	 */
-	now = now && (u64)(~0xFFFFF);
 	delta = now - sa->last_update_time;
 	/*
 	 * This should only happen when time goes backwards, which it
@@ -2712,6 +2714,9 @@ __update_load_avg(u64 now, int cpu, struct sched_avg *sa,
 		return 0;
 	}
 
+	/* Add how much left for the current period */
+	delta += sa->period_contrib;
+
 	/*
 	 * Use 1024*1024ns as an approximation of 1ms period, pretty close.
 	 */
@@ -2720,6 +2725,9 @@ __update_load_avg(u64 now, int cpu, struct sched_avg *sa,
 		return 0;
 	sa->last_update_time = now;
 
+	/* Get how much left for the next period */
+	sa->period_contrib = delta & (u64)(0xFFFFF);
+
 	scale_freq = arch_scale_freq_capacity(NULL, cpu);
 	scale_cpu = arch_scale_cpu_capacity(NULL, cpu);

> But as I wrote, we may either lose a sub-1ms, or gain a sub-1ms, statistically,
> they should even out, given the load/util updates are quite a large number of
> samples, and we do want a lot of samples for the metrics, this is the point of
> the entire average thing. Plus, as you also said, the incomplete current period
> also plays a (somewhat) negative role here.
> 
> In addition, excluding the flat hierarchical util patch, we gain quite some
> efficiency.