linux-kernel - Re: [PATCH 2/2] sched/fair: Always propagate runnable_load

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Date:   Tue, 2 May 2017 15:26:12 +0200
From:   Vincent Guittot <vincent.guittot@...aro.org>
To:     Tejun Heo <tj@...nel.org>
Cc:     Ingo Molnar <mingo@...hat.com>,
        Peter Zijlstra <peterz@...radead.org>,
        linux-kernel <linux-kernel@...r.kernel.org>,
        Linus Torvalds <torvalds@...ux-foundation.org>,
        Mike Galbraith <efault@....de>, Paul Turner <pjt@...gle.com>,
        Chris Mason <clm@...com>, kernel-team@...com
Subject: Re: [PATCH 2/2] sched/fair: Always propagate runnable_load_avg

Hi Tejun,

Le Tuesday 02 May 2017 à 09:18:53 (+0200), Vincent Guittot a écrit :
> On 28 April 2017 at 22:33, Tejun Heo <tj@...nel.org> wrote:
> > Hello, Vincent.
> >
> > On Thu, Apr 27, 2017 at 10:29:10AM +0200, Vincent Guittot wrote:
> >> On 27 April 2017 at 00:52, Tejun Heo <tj@...nel.org> wrote:
> >> > Hello,
> >> >
> >> > On Wed, Apr 26, 2017 at 08:12:09PM +0200, Vincent Guittot wrote:
> >> >> On 24 April 2017 at 22:14, Tejun Heo <tj@...nel.org> wrote:
> >> >> Can the problem be on the load balance side instead ?  and more
> >> >> precisely in the wakeup path ?
> >> >> After looking at the trace, it seems that task placement happens at
> >> >> wake up path and if it fails to select the right idle cpu at wake up,
> >> >> you will have to wait for a load balance which is alreayd too late
> >> >
> >> > Oh, I was tracing most of scheduler activities and the ratios of
> >> > wakeups picking idle CPUs were about the same regardless of cgroup
> >> > membership.  I can confidently say that the latency issue that I'm
> >> > seeing is from load balancer picking the wrong busiest CPU, which is
> >> > not to say that there can be other problems.
> >>
> >> ok. Is there any trace that you can share ? your behavior seems
> >> different of mine
> >
> >

[ snip]

> > You can notice that B's pertask weight is 4.409 which is way higher
> > than A's 2.779, and this is from Q014-asdf's contribution to Q014-/ is
> > twice as high as it should be.  The root queue's runnable avg should
> 
> Are you sure that this is because of blocked load in group A ? it can
> be that Q014-asdf has already have to wait before running and its load
> still increase while runnable but not running .
> IIUC your trace, group A has 2 running tasks and group B only one but
> load_balance selects B because of its sgs->avg_load being higher. But
> this can also happen even if runnable_load_avg of child cfs_rq was
> propagated correctly in group entity because we can have situation
> where a group A has only 1 task with higher load than 2 tasks on
> groupB and even if blocked load is not taken into account, and
> load_balance will select A.
> 
> IMHO, we should better improve load balance selection. I'm going to
> add smarter group selection in load_balance. that's something we
> should have already done but it was difficult without load/util_avg
> propagation. it should be doable now

Could you test the patch in load_balance below ?
If group is not overloaded which means that threads have all runtime they
want, we select the cfs_rq according to the number of running threads instead

---
 kernel/sched/fair.c | 22 ++++++++++++++++++++--
 1 file changed, 20 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index a903276..87e3b77 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7069,7 +7069,8 @@ static unsigned long task_h_load(struct task_struct *p)
 /********** Helpers for find_busiest_group ************************/
 
 enum group_type {
-	group_other = 0,
+	group_idle = 0,
+	group_other,
 	group_imbalanced,
 	group_overloaded,
 };
@@ -7383,6 +7384,9 @@ group_type group_classify(struct sched_group *group,
 	if (sgs->group_no_capacity)
 		return group_overloaded;
 
+	if (!sgs->sum_nr_running)
+		return group_idle;
+
 	if (sg_imbalanced(group))
 		return group_imbalanced;
 
@@ -7476,8 +7480,19 @@ static bool update_sd_pick_busiest(struct lb_env *env,
 	if (sgs->group_type < busiest->group_type)
 		return false;
 
-	if (sgs->avg_load <= busiest->avg_load)
+	if (sgs->group_type == group_other) {
+		/*
+		 * The groups are not overloaded so there is enough cpu time
+		 * for all threads. In this case, takes the group with the
+		 * highest number of tasks per CPU in order to improve
+		 * scheduling latency
+		 */
+		if ((sgs->sum_nr_running * busiest->group_weight) <=
+				(busiest->sum_nr_running * sgs->group_weight))
+			return false;
+	} if (sgs->avg_load <= busiest->avg_load) {
 		return false;
+	}
 
 	if (!(env->sd->flags & SD_ASYM_CPUCAPACITY))
 		goto asym_packing;
@@ -7969,6 +7984,9 @@ static struct rq *find_busiest_queue(struct lb_env *env,
 		    !check_cpu_capacity(rq, env->sd))
 			continue;
 
+		if (!rq->cfs.h_nr_running)
+			continue;
+
 		/*
 		 * For the load comparisons with the other cpu's, consider
 		 * the weighted_cpuload() scaled with the cpu capacity, so
-- 
2.7.4


> 
> > only contain what's currently active but because we're scaling load
> > avg which includes both active and blocked, we're ending up picking
> > group B over A.
> >