linux-kernel - [RFC][PATCH] sched: fix the nice-unfairness on SMP when offline a CPU

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [day] [month] [year] [list]

Message-ID: <49E8167D.8000005@cn.fujitsu.com>
Date:	Fri, 17 Apr 2009 13:41:17 +0800
From:	Miao Xie <miaox@...fujitsu.com>
To:	Ingo Molnar <mingo@...e.hu>,
	Peter Zijlstra <a.p.zijlstra@...llo.nl>
CC:	Linux-Kernel <linux-kernel@...r.kernel.org>
Subject: [RFC][PATCH] sched: fix the nice-unfairness on SMP when offline a
 CPU

I tested the fairness of scheduler on my multi-core box(2 CPUs * 2 Cores), and
found the nice-fairness was broken when I offlined a CPU. The CPU time gotten
by half of tasks was half as much as the others.

A test program which reproduces the problem on current kernel is attached.
This program forks a lot of child tasks, then the parent task gets the loop
count of every task and figures out the average and standard deviation every
5 seconds. (All of the child tasks do the same work - repeat doing sqrt)

Steps to reproduce:
 # echo 0 > /sys/devices/system/cpu/cpu3/online
 # ./sched-fair -p 8 -i 5 -v

By debuging, we found it is caused by the __cpu_power of the sched group. If
I offlined a CPU, the partition of sched groups in the CPU-level sched domain
is:
	+-----------+----------+
	| CPU0 CPU1 |   CPU2   |
	+-----------+----------+
and the __cpu_power of each sched group was 1024. It is strange that the first
sched group had two logic CPUs, the __cpu_power should be double times of the
second sched group. If both of the sched groups' __cpu_power was 1024, the load
balance program would balance the load fifty-fifty between these two sched
group, so half of the test tasks was moved to logic CPU2, and they got less CPU
time.

The code that caused this problem is following:
static void init_sched_groups_power(int cpu, struct sched_domain *sd)
{
	[snip]
	/*
	 * For perf policy, if the groups in child domain share resources
	 * (for example cores sharing some portions of the cache hierarchy
	 * or SMT), then set this domain groups cpu_power such that each group
	 * can handle only one task, when there are other idle groups in the
	 * same sched domain.
	 */
	if (!child || (!(sd->flags & SD_POWERSAVINGS_BALANCE) &&
		       (child->flags &
			(SD_SHARE_CPUPOWER | SD_SHARE_PKG_RESOURCES)))) {
		sg_inc_cpu_power(sd->groups, SCHED_LOAD_SCALE);
		return;
	}
	[snip]
}
According to the above comment, this design was in view of performance. But I
found there was no regression after applying this patch.

Test result on multi-core x86_64 box:
Before applying this patch:
AVERAGE		STD-DEV
1297.500	432.518

After applying this patch:
AVERAGE		STD-DEV
1297.250	118.857

Test result on hyper-threading x86_64 box:
Before applying this patch:
AVERAGE		STD-DEV
536.750		176.265

After applying this patch:
AVERAGE		STD-DEV
535.625		53.979

Maybe we need more test for it.

Signed-off-by: Miao Xie <miaox@...fujitsu.com>
---
 kernel/sched.c |   11 +----------
 1 files changed, 1 insertions(+), 10 deletions(-)

diff --git a/kernel/sched.c b/kernel/sched.c
index 5724508..07b08b2 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -7956,16 +7956,7 @@ static void init_sched_groups_power(int cpu, struct sched_domain *sd)
 
 	sd->groups->__cpu_power = 0;
 
-	/*
-	 * For perf policy, if the groups in child domain share resources
-	 * (for example cores sharing some portions of the cache hierarchy
-	 * or SMT), then set this domain groups cpu_power such that each group
-	 * can handle only one task, when there are other idle groups in the
-	 * same sched domain.
-	 */
-	if (!child || (!(sd->flags & SD_POWERSAVINGS_BALANCE) &&
-		       (child->flags &
-			(SD_SHARE_CPUPOWER | SD_SHARE_PKG_RESOURCES)))) {
+	if (!child) {
 		sg_inc_cpu_power(sd->groups, SCHED_LOAD_SCALE);
 		return;
 	}
-- 
1.6.0.3


View attachment "sched-fair.c" of type "text/x-csrc" (9884 bytes)