linux-kernel - [PATCH] sched/fair: Do not decay new task load on first enqueue

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [thread-next>] [day] [month] [year] [list]

Message-Id: <20160923115808.2330-1-matt@codeblueprint.co.uk>
Date:   Fri, 23 Sep 2016 12:58:08 +0100
From:   Matt Fleming <matt@...eblueprint.co.uk>
To:     Peter Zijlstra <peterz@...radead.org>,
        Ingo Molnar <mingo@...nel.org>
Cc:     linux-kernel@...r.kernel.org,
        Matt Fleming <matt@...eblueprint.co.uk>,
        Mike Galbraith <umgwanakikbuti@...il.com>,
        Yuyang Du <yuyang.du@...el.com>,
        Vincent Guittot <vincent.guittot@...aro.org>,
        Dietmar Eggemann <dietmar.eggemann@....com>
Subject: [PATCH] sched/fair: Do not decay new task load on first enqueue

Since commit 7dc603c9028e ("sched/fair: Fix PELT integrity for new
tasks") ::last_update_time will be set to a non-zero value in
post_init_entity_util_avg(), which leads to p->se.avg.load_avg being
decayed on enqueue before the task has even had a chance to run.

For a NICE_0 task the sequence of events leading up to this with
example load average changes might be,

  sched_fork()
    init_entity_runnable_average()
      p->se.avg.load_avg = scale_load_down(se->load.weight);	// 1024

  wake_up_new_task()
    post_init_entity_util_avg()
      attach_entity_load_avg()
        p->se.last_update_time = cfs_rq->avg.last_update_time;

    activate_task()
      enqueue_task()
        ...
          enqueue_entity_load_avg()
            migrated = !sa->last_update_time			// false
            if (!migrated)
                    __update_load_avg()
                      p->se.avg.load_avg = 1002

This causes a performance regression for fork intensive workloads like
hackbench. When balancing on fork we can end up picking the same CPU
to enqueue on over and over. This leads to huge congestion when trying
to simultaneously wake up tasks that are all on the same runqueue, and
causes lots of migrations on wake up.

The behaviour since commit 7dc603c9028e essentially defeats the
scheduler's attempt to balance on fork(). Before, ::runnable_load_avg
likely had a non-zero value when the hackbench tasks were dequeued
(the fork()'d tasks immediately block reading on pipe/socket) but now
the load balancer sees the CPU as having no runnable load.

Arguably the real problem is that balancing on fork doesn't look at
the blocked contribution of tasks, only the runnable load and it's
possible for the two metrics to be wildly different on a relatively
idle system.

But it still doesn't seem quite right to update a task's load_avg
before it runs for the first time.

Here are the results of running hackbench before 7dc603c9028e (old
behaviour), with 7dc603c9028e applied (exiting behaviour), and after
7dc603c9028e with this patch on top (new behaviour),

hackbench-process-sockets

                         4.7.0-rc5             4.7.0-rc5             4.7.0-rc5
                            before          7dc603c9028e                 after
Amean    1        0.0611 (  0.00%)      0.0693 (-13.32%)      0.0600 (  1.87%)
Amean    4        0.1777 (  0.00%)      0.1730 (  2.65%)      0.1790 ( -0.72%)
Amean    7        0.2771 (  0.00%)      0.2816 ( -1.60%)      0.2741 (  1.08%)
Amean    12       0.3851 (  0.00%)      0.4167 ( -8.20%)      0.3751 (  2.60%)

Cc: Peter Zijlstra <peterz@...radead.org>
Cc: Ingo Molnar <mingo@...nel.org>
Cc: Mike Galbraith <umgwanakikbuti@...il.com>
Cc: Yuyang Du <yuyang.du@...el.com>
Cc: Vincent Guittot <vincent.guittot@...aro.org>
Cc: Dietmar Eggemann <dietmar.eggemann@....com>
Signed-off-by: Matt Fleming <matt@...eblueprint.co.uk>
---
 kernel/sched/fair.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 8fb4d1942c14..4a2d3ff772f8 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3142,7 +3142,7 @@ enqueue_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se)
 	int migrated, decayed;

 	migrated = !sa->last_update_time;
-	if (!migrated) {
+	if (!migrated && se->sum_exec_runtime) {
 		__update_load_avg(now, cpu_of(rq_of(cfs_rq)), sa,
 			se->on_rq * scale_load_down(se->load.weight),
 			cfs_rq->curr == se, NULL);
-- 
2.10.0