linux-kernel - [RFC PATCHSET] sched/fair: fix load balancer behavior when cgroup is in use

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [thread-next>] [day] [month] [year] [list]

Message-ID: <20170424201344.GA14169@wtj.duckdns.org>
Date:   Mon, 24 Apr 2017 13:13:44 -0700
From:   Tejun Heo <tj@...nel.org>
To:     Ingo Molnar <mingo@...hat.com>,
        Peter Zijlstra <peterz@...radead.org>
Cc:     linux-kernel@...r.kernel.org,
        Linus Torvalds <torvalds@...ux-foundation.org>,
        Vincent Guittot <vincent.guittot@...aro.org>,
        Mike Galbraith <efault@....de>, Paul Turner <pjt@...gle.com>,
        Chris Mason <clm@...com>, kernel-team@...com
Subject: [RFC PATCHSET] sched/fair: fix load balancer behavior when cgroup is
 in use

Hello,

We've noticed scheduling latency spike when cgroup is in use even when
the machine is idle enough with moderate scheduling frequency and
single level of cgroup nesting.  More details are in the patch
descriptions but here's a schbench run from the root cgroup.

 # ~/schbench -m 2 -t 16 -s 10000 -c 15000 -r 30
 Latency percentiles (usec)
	 50.0000th: 26
	 75.0000th: 62
	 90.0000th: 74
	 95.0000th: 86
	 *99.0000th: 887
	 99.5000th: 3692
	 99.9000th: 10832
	 min=0, max=13374

And here's one from inside a first level cgroup.

 # ~/schbench -m 2 -t 16 -s 10000 -c 15000 -r 30
 Latency percentiles (usec)
	 50.0000th: 31
	 75.0000th: 65
	 90.0000th: 71
	 95.0000th: 91
	 *99.0000th: 7288
	 99.5000th: 10352
	 99.9000th: 12496
	 min=0, max=13023

The p99 latency spike got tracked down to runnable_load_avg not being
propagated through nested cfs_rqs and thus load_balance() operating on
out-of-sync load information.  It ended up picking the wrong CPU as
load balance target often enough to significantly impact p99 latency.

This patchset fixes the issue by always propagating runnable_load_avg
so that, regardless of nesting, every cfs_rq's runnable_load_avg is
the sum of the scaled loads of all tasks queued below it.

As a side effect, this changes the load_avg behavior of sched_entities
associated cfs_rq's.  It doesn't seem wrong to me and I can't think of
a better / cleaner way, but if there is, please let me know.

This patchset is on top of v4.11-rc8 and contains the following two
patches.

 sched/fair: Fix how load gets propagated from cfs_rq to its sched_entity
 sched/fair: Always propagate runnable_load_avg

diffstat follows.

 kernel/sched/fair.c |   46 +++++++++++++++++++---------------------------
 1 file changed, 19 insertions(+), 27 deletions(-)

Thanks.

-- 
tejun