linux-kernel - Group Imbalance bug - performance drop upto factor 10x

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [day] [month] [year] [list]

Date:   Tue, 7 Feb 2017 00:37:51 +0100
From:   Jirka Hladky <jhladky@...hat.com>
To:     linux-kernel <linux-kernel@...r.kernel.org>
Subject: Group Imbalance bug - performance drop upto factor 10x

Hello,

we observe that group imbalance bug can cause performance degradation
upto factor 10x on 4 NUMA server.

I have opened Bug 194231
https://bugzilla.kernel.org/show_bug.cgi?id=194231
for this issue.

The problem was first described in this paper

http://www.ece.ubc.ca/~sasha/papers/eurosys16-final29.pdf

in chapter 3.1. Scheduler is not correctly balancing load on 4 NUMA
node server in the following scenario:
 * there are three independent ssh connections
 * first two ssh connections are running single threaded CPU intensive workload
 * last ssh session is running multi-threaded application which
requires almost all cores in the system.

We have used
* stress --cpu 1 as single threaded CPU intensive workload
http://people.seas.harvard.edu/~apw/stress/
and
* lu.C.x benchmark from NAS Parallel Benchmarks suite as
multi-threaded application
https://www.nas.nasa.gov/publications/npb.html

Version-Release number of selected component (if applicable):
Reproduced on

kernel 4.10.0-0.rc6

How reproducible:

It requires at least 2 NUMA server. Problem gets worse on 4 NUMA server.

Steps to Reproduce:
1. start 3 ssh connections to server
2. in first two ssh connections run stress --cpu 1
3. in the third ssh connection run lu.C.x benchmark with number of
threads equal to number of CPUs in the system minus 4.
4. run either Intel's numatop
echo "N" | numatop -d log >/dev/null 2>&1 &
or mpstat -P ALL 5 and check the load distribution across the NUMA
nodes. mpstat output can be processed by mpstat2node.py utility to
aggregate data across NUMA nodes
https://github.com/jhladka/MPSTAT2NODE/blob/master/mpstat2node.py

mpstat -P ALL 5 | mpstat2node.py --lscpu <(lscpu)

5. Compare the results against the same workload started from ONE ssh
session (all processes are in one group)

Actual results:

Uneven load across NUMA nodes:
Average:    NODE    %usr     %idle
Average:     all   66.12      33.51
Average:       0   37.97      61.74
Average:       1   31.67      68.15
Average:       2   97.50       1.98
Average:       3   97.33       2.19

Please notice that while number of CPU intensive threads is 62 on this
64 CPU system, NUMA nodes #0 and #1 are underutilized.

Real runtime in seconds for lu.C.x benchmark went up from 114 seconds
to 846 seconds!

Expected results:

Load evenly balanced across all NUMA nodes. Real runtime for lu.C.x
benchmark same regardless if jobs were started from one ssh session or
from multiply ssh sessions.

Additional info:

See
https://github.com/jplozi/wastedcores/blob/master/patches/group_imbalance_linux_4.1.patch
as proposal for the patch for kernel 4.1.

I will upload a reproduced to the Bug report
https://bugzilla.kernel.org/show_bug.cgi?id=194231

Thanks a lot!
Jirka