linux-kernel - Re: [PATCH -v2 12/18] sched/fair: Rewrite PELT migration propagation

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20171013152254.GA7393@linaro.org>
Date:   Fri, 13 Oct 2017 17:22:54 +0200
From:   Vincent Guittot <vincent.guittot@...aro.org>
To:     Peter Zijlstra <peterz@...radead.org>
Cc:     Ingo Molnar <mingo@...nel.org>,
        linux-kernel <linux-kernel@...r.kernel.org>,
        Tejun Heo <tj@...nel.org>, Josef Bacik <josef@...icpanda.com>,
        Linus Torvalds <torvalds@...ux-foundation.org>,
        Mike Galbraith <efault@....de>, Paul Turner <pjt@...gle.com>,
        Chris Mason <clm@...com>,
        Dietmar Eggemann <dietmar.eggemann@....com>,
        Morten Rasmussen <morten.rasmussen@....com>,
        Ben Segall <bsegall@...gle.com>,
        Yuyang Du <yuyang.du@...el.com>
Subject: Re: [PATCH -v2 12/18] sched/fair: Rewrite PELT migration propagation

Hi Peter,

Le Tuesday 10 Oct 2017 à 09:44:53 (+0200), Vincent Guittot a écrit :
> On 10 October 2017 at 09:29, Peter Zijlstra <peterz@...radead.org> wrote:
> > On Mon, Oct 09, 2017 at 05:29:04PM +0200, Vincent Guittot wrote:
> >> On 9 October 2017 at 17:03, Vincent Guittot <vincent.guittot@...aro.org> wrote:
> >> > On 1 September 2017 at 15:21, Peter Zijlstra <peterz@...radead.org> wrote:
> >
> >> >> +/*
> >> >> + * When on migration a sched_entity joins/leaves the PELT hierarchy, we need to
> >> >> + * propagate its contribution. The key to this propagation is the invariant
> >> >> + * that for each group:
> >> >> + *
> >> >> + *   ge->avg == grq->avg                                               (1)
> >> >> + *
> >> >> + * _IFF_ we look at the pure running and runnable sums. Because they
> >> >> + * represent the very same entity, just at different points in the hierarchy.
> >> >
> >> > I agree for the running part because only one entity can be running
> >> > but i'm not sure for the pure runnable sum because we can have
> >> > several runnable task in a cfs_rq but only one runnable group entity
> >> > to reflect them or I misunderstand (1)
> >
> > The idea is that they (ge and grq) are the _same_ entity, just at
> > different levels in the hierarchy. If the grq is runnable, it is through
> > the ge.
> >
> > As a whole, they don't care how many runnable tasks there are.
> 
> Ok so I agree with this point that both grp and ge runnable follow the
> same trend
> 
> my point is that even if same changes apply on both grp and ge, we
> can't directly apply changes of grp's runnable on ge's runnable
> 
> >
> >> > As an example, we have 2 always running task TA and TB so their
> >> > load_sum is LOAD_AVG_MAX for each task The grq->avg.load_sum = \Sum
> >> > se->avg.load_sum = 2*LOAD_AVG_MAX But the ge->avg.load_sum will be
> >> > only LOAD_AVG_MAX
> >> >
> >> > So If we apply directly the d(TB->avg.load_sum) on the group hierachy
> >> > and on ge->avg.load_sum in particular, the latter decreases to 0
> >> > whereas it should decrease only by half
> >> >
> >> > I have been able to see this wrong behavior with a rt-app json file
> >> >
> >> > so I think that we should instead remove only
> >> >
> >> > delta = se->avg.load_sum / grq->avg.load_sum * ge->avg.load_sum
> >>
> >> delta = se->avg.load_sum / (grq->avg.load_sum+se->avg.load_sum) *
> >> ge->avg.load_sum
> >>
> >> as the se has already been detached
> >>
> >> > We don't have grq->avg.load_sum but we can have a rough estimate with
> >> > grq->avg.load_avg/grq->weight
> >
> > Hurm, I think I see what you're saying, let me ponder this more.
> 
> The formula above works was an example for detaching but it doesn't
> apply for all use case. we need a more generic way to propagate
> runnable changes

I have studied a bit more how to improve the propagation formula and the
changes below is doing the job for the UCs that I have tested.

Unlike running, we can't directly propagate the runnable through hierarchy
when we migrate a task. Instead we must ensure that we will not
over/underestimate the impact of the migration thanks to several rules:
-ge->avg.runnable_sum can't be higher than LOAD_AVG_MAX
-ge->avg.runnable_sum can't be lower than ge->avg.running_sum (once scaled to
the same range)
-we can't directly propagate a negative delta of runnable_sum because part of
this runnable time can be "shared" with others sched_entities and stays on the
gcfs_rq. Instead, we can't estimate the new runnable_sum of the gcfs_rq with
the formula: gcfs_rq's runnable sum = gcfs_rq's load_sum / gcfs_rq's weight.
-ge->avg.runnable_sum can't increase when we detach a task.

---
 kernel/sched/fair.c | 56 ++++++++++++++++++++++++++++++++++++++++++-----------
 1 file changed, 45 insertions(+), 11 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 350dbec0..a063b048 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3489,33 +3489,67 @@ update_tg_cfs_util(struct cfs_rq *cfs_rq, struct sched_entity *se, struct cfs_rq
 static inline void
 update_tg_cfs_runnable(struct cfs_rq *cfs_rq, struct sched_entity *se, struct cfs_rq *gcfs_rq)
 {
-	long runnable_sum = gcfs_rq->prop_runnable_sum;
-	long runnable_load_avg, load_avg;
-	s64 runnable_load_sum, load_sum;
+	long running_sum, runnable_sum = gcfs_rq->prop_runnable_sum;
+	long runnable_load_avg, delta_avg, load_avg;
+	s64 runnable_load_sum, delta_sum, load_sum = 0;
 
 	if (!runnable_sum)
 		return;
 
 	gcfs_rq->prop_runnable_sum = 0;
 
+	/*
+	 * Get a rough estimate of gcfs_rq's runnable
+	 * This is a low guess as it assumes that tasks are equally
+	 * runnable which is not true but we can't really do better
+	 */
+	if (scale_load_down(gcfs_rq->load.weight))
+		load_sum = div_s64(gcfs_rq->avg.load_sum,
+				scale_load_down(gcfs_rq->load.weight));
+
+	/*
+	 * Propating a delta of runnable is not just adding it to ge's
+	 * runnable_sum:
+	 * - Adding a delta runnable can't make ge's runnable_sum higher than
+	 *   LOAD_AVG_MAX
+	 * - We can't directly remove a delta of runnable from
+	 *   ge's runnable_sum but we can only guest estimate what runnable
+	 *   will become thanks to few simple rules:
+	 *   - gcfs_rq's runnable is a good estimate
+	 *   - ge's runnable_sum can't increase when we remove runnable
+	 *   - runnable_sum can't be lower than running_sum
+	 */
+	if (runnable_sum >= 0) {
+		runnable_sum += se->avg.load_sum;
+		runnable_sum = min(runnable_sum, LOAD_AVG_MAX);
+	} else
+		runnable_sum = min(se->avg.load_sum, load_sum);
+
+	running_sum = se->avg.util_sum >> SCHED_CAPACITY_SHIFT;
+	runnable_sum = max(runnable_sum, running_sum);
+
 	load_sum = (s64)se_weight(se) * runnable_sum;
 	load_avg = div_s64(load_sum, LOAD_AVG_MAX);
 
-	add_positive(&se->avg.load_sum, runnable_sum);
-	add_positive(&se->avg.load_avg, load_avg);
+	delta_sum = load_sum - (s64)se_weight(se) * se->avg.load_sum;
+	delta_avg = load_avg - se->avg.load_avg;
 
-	add_positive(&cfs_rq->avg.load_avg, load_avg);
-	add_positive(&cfs_rq->avg.load_sum, load_sum);
+	se->avg.load_sum = runnable_sum;
+	se->avg.load_avg = load_avg;
+	add_positive(&cfs_rq->avg.load_avg, delta_avg);
+	add_positive(&cfs_rq->avg.load_sum, delta_sum);
 
 	runnable_load_sum = (s64)se_runnable(se) * runnable_sum;
 	runnable_load_avg = div_s64(runnable_load_sum, LOAD_AVG_MAX);
+	delta_sum = runnable_load_sum - se_weight(se) * se->avg.runnable_load_sum;
+	delta_avg = runnable_load_avg - se->avg.runnable_load_avg;
 
-	add_positive(&se->avg.runnable_load_sum, runnable_sum);
-	add_positive(&se->avg.runnable_load_avg, runnable_load_avg);
+	se->avg.runnable_load_sum = runnable_sum;
+	se->avg.runnable_load_avg = runnable_load_avg;
 
 	if (se->on_rq) {
-		add_positive(&cfs_rq->avg.runnable_load_avg, runnable_load_avg);
-		add_positive(&cfs_rq->avg.runnable_load_sum, runnable_load_sum);
+		add_positive(&cfs_rq->avg.runnable_load_avg, delta_avg);
+		add_positive(&cfs_rq->avg.runnable_load_sum, delta_sum);
 	}
 }
 
-- 
2.7.4