linux-kernel - group scheduler regression since 4.3 (bisect 9d89c257d sched/fair: Rewrite runnable load and utilization average tracking)

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [thread-next>] [day] [month] [year] [list]

Message-Id: <45222b6f-4849-f1f4-fdf5-2a26ac9a3ed4@de.ibm.com>
Date:   Mon, 26 Sep 2016 12:42:22 +0200
From:   Christian Borntraeger <borntraeger@...ibm.com>
To:     Yuyang Du <yuyang.du@...el.com>
Cc:     Peter Zijlstra <peterz@...radead.org>,
        Ingo Molnar <mingo@...nel.org>,
        Linux Kernel Mailing List <linux-kernel@...r.kernel.org>
Subject: group scheduler regression since 4.3 (bisect 9d89c257d sched/fair:
 Rewrite runnable load and utilization average tracking)

Folks,

I have seen big scalability degredations sind 4.3 (bisected 9d89c257d
sched/fair: Rewrite runnable load and utilization average tracking)
This has not been fixed by subsequent patches,e.g. the ones that try to
fix this for interactive workload.

The problem is only visible for sleep/wakeup heavy workload which must
be part of the scheduler group (e.g. a sysbench OLTP inside a KVM guest
as libvirt will put KVM guests into cgroup instances).

For example a simple sysbench oltp with mysql inside a KVM guests with
16 CPUs backed by 8 host cpus (16 host threads) scales less (scale up
inside a guest, having multiple instances). This is the numbers of
events per second.
Unmounting /sys/fs/cgroup/cpu,cpuacct (thus forcing libvirt to not
use group scheduling for KVM guests) makes the behaviour much better:

instances	group		nogroup
1		3406		3002
2		5078		4940
3		6017		6760
4		6471		8216 (+27%)
5		6716		9196
6		6976		9783
7		7127		10170
8		7399		10385 (+40%)

before 9d89c257d ("sched/fair: Rewrite runnable load and utilization
average tracking") there was basically no difference between group
or non-group scheduling. These numbers are with 4.7, older kernels after
9d89c257d show a similar difference.

The bad thing is that there is a lot of idle cpu power in the host
when this happens so the scheduler seems to not realize that this
workload could use more cpus in the host.

I tried some experiments , but I have not found a hack that "fixes" the
degredation, which would give me an indication which part  of the code
is broken. So are there any ideas? Is the estimated group load
calculation just not fast enough for sleep/wakeup workload?

Christian