linux-kernel - Re: [PATCH 3/7] sched/fair: Correct unit of load_above

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20160503145641.GA759@e105550-lin.cambridge.arm.com>
Date:	Tue, 3 May 2016 15:56:43 +0100
From:	Morten Rasmussen <morten.rasmussen@....com>
To:	Peter Zijlstra <peterz@...radead.org>
Cc:	Dietmar Eggemann <dietmar.eggemann@....com>,
	linux-kernel@...r.kernel.org
Subject: Re: [PATCH 3/7] sched/fair: Correct unit of load_above_capacity

On Tue, May 03, 2016 at 12:52:33PM +0200, Peter Zijlstra wrote:
> On Fri, Apr 29, 2016 at 08:32:40PM +0100, Dietmar Eggemann wrote:
> > From: Morten Rasmussen <morten.rasmussen@....com>
> > 
> > In calculate_imbalance() load_above_capacity currently has the unit
> > [load] while it is used as being [load/capacity]. Not only is it wrong it
> > also makes it unlikely that load_above_capacity is ever used as the
> > subsequent code picks the smaller of load_above_capacity and the avg_load
> > difference between the local and the busiest sched_groups.
> > 
> > This patch ensures that load_above_capacity has the right unit
> > [load/capacity].
> > 
> > Signed-off-by: Morten Rasmussen <morten.rasmussen@....com>
> > ---
> >  kernel/sched/fair.c | 6 ++++--
> >  1 file changed, 4 insertions(+), 2 deletions(-)
> > 
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index bc19fee94f39..c066574cff04 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -7016,9 +7016,11 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
> >  	    local->group_type   == group_overloaded) {
> >  		load_above_capacity = busiest->sum_nr_running *
> >  					SCHED_LOAD_SCALE;
> 
> Given smpnice and cgroups; this starting point seems somewhat random.
> Then again, without cgroup and assuming all tasks are nice-0 it is still
> true. But I think the cgroup muck is rather universal these days.

It makes sense for non-grouped always-running nice-0 tasks, and not so
much for all other scenarios. We ignore smpnice, PELT, and cgroups :-(

> 
> The above SCHED_LOAD_SCALE is the 1 -> load metric, right?

Yes. For clarity I would probably have used NICE_O_LOAD instead. We are
establishing how much capacity sum_nr_running task would consume if they
are all non-grouped always-running nice-0 tasks.

> 
> > -		if (load_above_capacity > busiest->group_capacity)
> > +		if (load_above_capacity > busiest->group_capacity) {
> >  			load_above_capacity -= busiest->group_capacity;

If they would consume more than we have, we return how much more
otherwise LONG_MAX.

> > -		else
> > +			load_above_capacity *= SCHED_LOAD_SCALE;
> > +			load_above_capacity /= busiest->group_capacity;
> 
> And this one does load->capacity

It does load->[load/capacity]

The subsequent code does [load/capacity]->load:

	env->imbalance = min(
-->		max_pull * busiest->group_capacity,
		(sds->avg_load - local->avg_load) * local->group_capacity
	) / SCHED_CAPACITY_SCALE;

> 
> > +		} else
> >  			load_above_capacity = ~0UL;
> >  	}
> 
> 
> Is there anything better we can do? I know there's a fair bit of cruft
> in; but these assumptions that SCHED_LOAD_SCALE is one task, just bug
> the hell out of me.
>
> Would it make sense to make load_balance()->detach_tasks() ensure
> busiest->sum_nr_running >= busiest->group_weight or so?

That would make a lot of sense, because that is essentially what the
code does for non-SMT systems. The story is a bit different for SMT
though since:

	busiest->group_capacity < busiest->group_weight * SCHED_LOAD_SCALE.

This could lead us to try pulling more tasks to vacate SMT hw-threads.
However, I'm not convinced if it has any effect as:

	max_pull = min(busiest->avg_load - sds->avg_load, load_above_capacity);

is the _minimum_ of the actual load difference and the 'excess' load, so
load_above_capacity can't cause us to pull more load anyway. So I don't
see why it shouldn't be okay.

Something like the below should implement your proposal I think (only
boot tested on arm64):

---8<---

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 0fe30e6..37fcbec 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5777,6 +5777,7 @@ struct lb_env {
 	int			new_dst_cpu;
 	enum cpu_idle_type	idle;
 	long			imbalance;
+	long			imbalance_tasks;
 	/* The set of CPUs under consideration for load-balancing */
 	struct cpumask		*cpus;
 
@@ -6018,7 +6019,7 @@ static int detach_tasks(struct lb_env *env)
 
 	lockdep_assert_held(&env->src_rq->lock);
 
-	if (env->imbalance <= 0)
+	if (env->imbalance <= 0 || env->imbalance_tasks == 0)
 		return 0;
 
 	while (!list_empty(tasks)) {
@@ -6072,9 +6073,9 @@ static int detach_tasks(struct lb_env *env)
 
 		/*
 		 * We only want to steal up to the prescribed amount of
-		 * weighted load.
+		 * weighted load or max number of tasks.
 		 */
-		if (env->imbalance <= 0)
+		if (env->imbalance <= 0 || env->imbalance_tasks <= detached)
 			break;
 
 		continue;
@@ -6873,12 +6874,14 @@ void fix_small_imbalance(struct lb_env *env, struct sd_lb_stats *sds)
  */
 static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *sds)
 {
-	unsigned long max_pull, load_above_capacity = ~0UL;
+	unsigned long max_pull;
 	struct sg_lb_stats *local, *busiest;
 
 	local = &sds->local_stat;
 	busiest = &sds->busiest_stat;
 
+	env->imbalance_tasks = ~0UL;
+
 	if (busiest->group_type == group_imbalanced) {
 		/*
 		 * In the group_imb case we cannot rely on group-wide averages
@@ -6904,23 +6907,17 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
 	 */
 	if (busiest->group_type == group_overloaded &&
 	    local->group_type   == group_overloaded) {
-		load_above_capacity = busiest->sum_nr_running *
-					SCHED_LOAD_SCALE;
-		if (load_above_capacity > busiest->group_capacity)
-			load_above_capacity -= busiest->group_capacity;
-		else
-			load_above_capacity = ~0UL;
+		if (busiest->sum_nr_running > busiest->group_weight)
+			env->imbalance_tasks =
+				busiest->sum_nr_running - busiest->group_weight;
 	}
 
 	/*
 	 * We're trying to get all the cpus to the average_load, so we don't
 	 * want to push ourselves above the average load, nor do we wish to
-	 * reduce the max loaded cpu below the average load. At the same time,
-	 * we also don't want to reduce the group load below the group capacity
-	 * (so that we can implement power-savings policies etc). Thus we look
-	 * for the minimum possible imbalance.
+	 * reduce the max loaded cpu below the average load.
 	 */
-	max_pull = min(busiest->avg_load - sds->avg_load, load_above_capacity);
+	max_pull = busiest->avg_load - sds->avg_load;
 
 	/* How much load to actually move to equalise the imbalance */
 	env->imbalance = min(
-- 
1.9.1