linux-kernel - Re: [PATCH 0/6] [RFC] Large weight differential leads to inefficient load balancing

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <AANLkTimMyUd9BxXO5JxgpAiAKCJAS1xZiTXTgp+GKSU0@mail.gmail.com>
Date:	Tue, 3 Aug 2010 14:28:37 -0700
From:	Nikhil Rao <ncrao@...gle.com>
To:	Peter Zijlstra <peterz@...radead.org>
Cc:	Ingo Molnar <mingo@...e.hu>, Mike Galbraith <efault@....de>,
	linux-kernel@...r.kernel.org,
	Venkatesh Pallipadi <venki@...gle.com>,
	Ken Chen <kenchen@...gle.com>, Paul Turner <pjt@...gle.com>
Subject: Re: [PATCH 0/6] [RFC] Large weight differential leads to inefficient 
	load balancing

Peter,

Thanks for the feedback!

On Mon, Aug 2, 2010 at 4:39 AM, Peter Zijlstra <peterz@...radead.org> wrote:
> The thing is that from a fairness point of view 1 nice-0 (weight=1024)
> on one CPU and 512 SCHED_IDLE (weight=2) tasks on another CPU and all
> other CPUs idle is correct.
>
> It just doesn't seem to be the thing that most people expect.
>
> Special casing things like you've done is utterly the wrong thing to do.
>

I see your point here, and yes I agree having 1 nice-0 on one cpu, 512
SCHED_IDLE tasks on another cpu and all other cpus idle is correct if
we only considered fairness. However, we would also like to maximize
machine utilization. The fitness function we would ideally like to
optimize for is a combination of both fairness and utilization.

I'm going to take a step back here and explain the bigger problem we
are trying to solve. We have two types of workloads -- high priority
tasks that need to run for short periods of time and can consume as
much cpu as required when they run; and low priority tasks that
ideally consume slack cpu. Let's say we had a machine with enough
tasks to soak up all the cpus. There are two cases to implement the
priority scheme.

The first case is to run all tasks run as SCHED_NORMAL. The machine is
fully utilized but this has other side-effects. Low priority tasks
consume much more cpu than we would like, they get to preempt high
priority tasks, and this results in poor performance of high priority
tasks.

The second case is to run high priority tasks as SCHED_NORMAL and low
priority tasks as SCHED_IDLE. We choose SCHED_IDLE to minimize
interference with high priority tasks -- we don't want low priority
tasks to preempt high priority or to be favored over high priority
tasks in any way. When there are high priority tasks, we want low
priority tasks to get as little cpu as possible; thus giving them the
minimum possible weight works out well. In this case, we are able to
isolate high prio tasks from low prio tasks reasonably well. However,
sub-optimal machine utilization defeats the purpose of packing a
machine with lots of low priority tasks as we are not able to consume
the slack cpu.

This RFC is really meant to explain the problem that we are facing. We
presented one possible solution to fix this but we are also open to
other suggestions and ideas.

> This problem comes in two forms and its name is infeasible weight
> distribution. The load-balancer tries to ensure W_k ~= W_l, k,l elem_of
> CPUs, where W_k = \Sum_i w_i^k, where w_i^k is the i-th task on CPU k.
>
> The two cases are statically infeasible, and dynamically infeasible.
>
> We say the task-set is statically infeasible if for a task set of n
> tasks there is no way to statically distribute them on N <= n CPUs such
> that each task gets equal service (assuming the scheduling on each CPU
> is fair).
>
> We say the task-set is dynamically infeasible if for the given scenario
> there is no way to rotate the tasks to obtain equality.
>
> Lets assume 2 CPUs.
>
> Ex.1: 2 tasks of different weight.
>
> Ex.2: 3 tasks of equal weight.
>
> The first example is both statically and dynamically infeasible as there
> is no way to occupy both CPUs such that each task gets the proportional
> correct service.
>
> The second example is statically infeasible, but dynamically feasible,
> for if we rotate one task, such that we alternate between 2:1 and 1:2 in
> equal measures, each task will receive its correct 2/3rd CPU service.
>
> The current load-balancer isn't particularly skilled at either issue.
>
> The proper solution is to 'fix' find_busiest_group() so that it will:
>  - pick the heaviest cpu with more than 1 task on it
>  - slowly over-balance things
>
> The first thing will solve your issue.
>

Thanks for your suggestions; I explored the first one a bit and I
added a check into find_busiest_queue() (instead of
find_busiest_group()) to skip a cpu if it has only 1 task on it (patch
attached below - did you have something else in mind?). This fixes the
example I posted in the RFC, but it doesn't work as well when the
SCHED_NORMAL tasks have a sleep/wakeup pattern. I have some data below
where the load balancer fails to fully utilize a machine. In these
examples, I ran with the upstream kernel and with a kernel compiled
with the check in fbq().

Setup: We run 16 SCHED_IDLE soakers and about half as many
SCHED_NORMAL tasks which have 100ms / 100ms sleep/busy cycles. This is
actually a very common use case that we run into.

2.6.35-rc6

12:41:45 PM  CPU   %user   %nice    %sys %iowait    %irq   %soft
%steal   %idle    intr/s
12:41:46 PM  all   97.50    0.00    0.00    0.00    0.00    0.00
0.00    2.50  16405.00
12:41:47 PM  all   89.72    0.00    0.06    0.00    0.00    0.00
0.00   10.22  15736.00
12:41:48 PM  all   90.54    0.06    0.06    0.00    0.00    0.00
0.00    9.34  15791.09
12:41:49 PM  all   93.00    0.00    0.06    0.00    0.00    0.00
0.00    6.93  15816.83
12:41:50 PM  all   96.44    0.00    0.06    0.00    0.00    0.06
0.00    3.44  16362.00
12:41:51 PM  all   97.62    0.00    0.06    0.00    0.00    0.00
0.00    2.31  16326.00
12:41:52 PM  all   99.56    0.00    0.06    0.00    0.00    0.00
0.00    0.38  16512.12
12:41:53 PM  all   99.06    0.00    0.06    0.00    0.00    0.00
0.00    0.88  16289.00
12:41:54 PM  all   99.50    0.00    0.06    0.00    0.00    0.00
0.00    0.44  16149.50
12:41:55 PM  all   98.06    0.00    0.06    0.00    0.00    0.00
0.00    1.88  16405.05
Average:     all   96.10    0.01    0.06    0.00    0.00    0.01
0.00    3.83  16177.92

2.6.35-rc6 + fbg-fix

12:42:48 PM  CPU   %user   %nice    %sys %iowait    %irq   %soft
%steal   %idle    intr/s
12:42:49 PM  all   98.07    0.00    0.06    0.00    0.00    0.00
0.00    1.87  16346.00
12:42:50 PM  all   98.75    0.00    0.12    0.00    0.00    0.00
0.00    1.12  16236.63
12:42:51 PM  all   99.56    0.06    0.19    0.00    0.00    0.00
0.00    0.19  16616.16
12:42:52 PM  all   97.94    0.00    0.06    0.00    0.00    0.06
0.00    1.94  16348.00
12:42:53 PM  all   96.94    0.00    0.25    0.00    0.00    0.00
0.00    2.81  16234.65
12:42:54 PM  all   98.56    0.06    0.06    0.00    0.00    0.00
0.00    1.31  16339.00
12:42:55 PM  all   97.56    0.00    0.19    0.00    0.00    0.00
0.00    2.25  16570.71
12:42:56 PM  all   99.50    0.00    0.06    0.00    0.00    0.00
0.00    0.44  16400.00
12:42:57 PM  all   95.25    0.00    0.37    0.00    0.00    0.00
0.00    4.37  16367.00
12:42:58 PM  all   97.75    0.00    0.06    0.00    0.00    0.00
0.00    2.19  16409.00
Average:     all   97.99    0.01    0.14    0.00    0.00    0.01
0.00    1.85  16386.00

-Thanks,
Nikhil

---
fqb-fix patch:

diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
index a878b53..e05c61f 100644
--- a/kernel/sched_fair.c
+++ b/kernel/sched_fair.c
@@ -2742,13 +2742,8 @@ find_busiest_queue(struct sched_group *group,
enum cpu_idle_type idle,
                        continue;

                rq = cpu_rq(i);
-               wl = weighted_cpuload(i);

-               /*
-                * When comparing with imbalance, use weighted_cpuload()
-                * which is not scaled with the cpu power.
-                */
-               if (capacity && rq->nr_running == 1 && wl > imbalance)
+               if (capacity && rq->nr_running == 1)
                        continue;

                /*
@@ -2757,6 +2752,7 @@ find_busiest_queue(struct sched_group *group,
enum cpu_idle_type idle,
                 * the load can be moved away from the cpu that is potentially
                 * running at a lower capacity.
                 */
+               wl = weighted_cpuload(i);
                wl = (wl * SCHED_LOAD_SCALE) / power;

                if (wl > max_load) {
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/