lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Date:   Tue, 14 Jul 2020 08:59:41 -0400
From:   peter.puhov@...aro.org
To:     linux-kernel@...r.kernel.org
Cc:     peter.puhov@...aro.org, robert.foley@...aro.org,
        Ingo Molnar <mingo@...hat.com>,
        Peter Zijlstra <peterz@...radead.org>,
        Juri Lelli <juri.lelli@...hat.com>,
        Vincent Guittot <vincent.guittot@...aro.org>,
        Dietmar Eggemann <dietmar.eggemann@....com>,
        Steven Rostedt <rostedt@...dmis.org>,
        Ben Segall <bsegall@...gle.com>, Mel Gorman <mgorman@...e.de>
Subject: [PATCH v1] sched/fair: update_pick_idlest() Select group with lowest group_util when idle_cpus are equal

From: Peter Puhov <peter.puhov@...aro.org>

v0: https://lkml.org/lkml/2020/6/16/1286

Changes in v1:
- Test results formatted in a table form as suggested by Valentin Schneider
- Added explanation by Vincent Guittot why nr_running may not be sufficient

In slow path, when selecting idlest group, if both groups have type 
group_has_spare, only idle_cpus count gets compared.
As a result, if multiple tasks are created in a tight loop,
and go back to sleep immediately 
(while waiting for all tasks to be created), 
they may be scheduled on the same core, because CPU is back to idle
when the new fork happen.

For example:
sudo perf record -e sched:sched_wakeup_new -- \
                                  sysbench threads --threads=4 run
...
    total number of events:              61582
...
sudo perf script
sysbench 129378 [006] 74586.633466: sched:sched_wakeup_new: 
                            sysbench:129380 [120] success=1 CPU:007
sysbench 129378 [006] 74586.634718: sched:sched_wakeup_new: 
                            sysbench:129381 [120] success=1 CPU:007
sysbench 129378 [006] 74586.635957: sched:sched_wakeup_new: 
                            sysbench:129382 [120] success=1 CPU:007
sysbench 129378 [006] 74586.637183: sched:sched_wakeup_new: 
                            sysbench:129383 [120] success=1 CPU:007

This may have negative impact on performance for workloads with frequent
creation of multiple threads.

In this patch we are using group_util to select idlest group if both groups
have equal number of idle_cpus. Comparing the number of idle cpu is 
not enough in this case, because the newly forked thread sleeps 
immediately and before we select the cpu for the next one. 
This is shown in the trace where the same CPU7 is selected for 
all wakeup_new events.
That's why, looking at utilization when there is the same number of
CPU is a good way to see where the previous task was placed. Using
nr_running doesn't solve the problem because the newly forked task is not
running and the cpu would not have been idle in this case and an idle
CPU would have been selected instead.

With this patch newly created tasks would be better distributed. 

With this patch:
sudo perf record -e sched:sched_wakeup_new -- \
                                    sysbench threads --threads=4 run
...
    total number of events:              74401
...
sudo perf script
sysbench 129455 [006] 75232.853257: sched:sched_wakeup_new: 
                            sysbench:129457 [120] success=1 CPU:008
sysbench 129455 [006] 75232.854489: sched:sched_wakeup_new: 
                            sysbench:129458 [120] success=1 CPU:009
sysbench 129455 [006] 75232.855732: sched:sched_wakeup_new: 
                            sysbench:129459 [120] success=1 CPU:010
sysbench 129455 [006] 75232.856980: sched:sched_wakeup_new: 
                            sysbench:129460 [120] success=1 CPU:011


We tested this patch with following benchmarks:
master: 'commit b3a9e3b9622a ("Linux 5.8-rc1")' 

100 iterations of: perf bench -f simple futex wake -s -t 128 -w 1
Lower result is better
|         |   BASELINE |   +PATCH |   DELTA (%) |
|---------|------------|----------|-------------|
| mean    |      0.33  |    0.313 |      +5.152 |
| std (%) |     10.433 |    7.563 |             |


100 iterations of: sysbench threads --threads=8 run
Higher result is better
|         |   BASELINE |   +PATCH |   DELTA (%) |
|---------|------------|----------|-------------|
| mean    |   5235.02  | 5863.73  |      +12.01 |
| std (%) |      8.166 |   10.265 |             |


100 iterations of: sysbench mutex --mutex-num=1 --threads=8 run
Lower result is better
|         |   BASELINE |   +PATCH |   DELTA (%) |
|---------|------------|----------|-------------|
| mean    |      0.413 |    0.404 |      +2.179 |
| std (%) |      3.791 |    1.816 |             |


Signed-off-by: Peter Puhov <peter.puhov@...aro.org>
---
 kernel/sched/fair.c | 8 +++++++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 02f323b85b6d..abcbdf80ee75 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -8662,8 +8662,14 @@ static bool update_pick_idlest(struct sched_group *idlest,
 
 	case group_has_spare:
 		/* Select group with most idle CPUs */
-		if (idlest_sgs->idle_cpus >= sgs->idle_cpus)
+		if (idlest_sgs->idle_cpus > sgs->idle_cpus)
 			return false;
+
+		/* Select group with lowest group_util */
+		if (idlest_sgs->idle_cpus == sgs->idle_cpus &&
+			idlest_sgs->group_util <= sgs->group_util)
+			return false;
+
 		break;
 	}
 
-- 
2.20.1

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ