[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-Id: <20220217154403.6497-1-wuyun.abel@bytedance.com>
Date: Thu, 17 Feb 2022 23:43:56 +0800
From: Abel Wu <wuyun.abel@...edance.com>
To: Ben Segall <bsegall@...gle.com>,
Daniel Bristot de Oliveira <bristot@...hat.com>,
Dietmar Eggemann <dietmar.eggemann@....com>,
Ingo Molnar <mingo@...hat.com>,
Juri Lelli <juri.lelli@...hat.com>,
Mel Gorman <mgorman@...e.de>,
Peter Zijlstra <peterz@...radead.org>,
Steven Rostedt <rostedt@...dmis.org>,
Vincent Guittot <vincent.guittot@...aro.org>
Cc: linux-kernel@...r.kernel.org
Subject: [RFC PATCH 0/5] introduce sched-idle balancing
Current load balancing is mainly based on cpu capacity
and task util, which makes sense in the POV of overall
throughput. While there still might be some improvement
can be done by reducing number of overloaded cfs rqs if
sched-idle or idle rq exists.
An CFS runqueue is considered overloaded when there are
more than one pullable non-idle tasks on it (since sched-
idle cpus are treated as idle cpus). And idle tasks are
counted towards rq->cfs.idle_h_nr_running, that is either
assigned SCHED_IDLE policy or placed under idle cgroups.
The overloaded cfs rqs can cause performance issues to
both task types:
- for latency critical tasks like SCHED_NORMAL,
time of waiting in the rq will increase and
result in higher pct99 latency, and
- batch tasks may not be able to make full use
of cpu capacity if sched-idle rq exists, thus
presents poorer throughput.
So in short, the goal of the sched-idle balancing is to
let the *non-idle tasks* make full use of cpu resources.
To achieve that, we mainly do two things:
- pull non-idle tasks for sched-idle or idle rqs
from the overloaded ones, and
- prevent pulling the last non-idle task in an rq
The mask of overloaded cpus is updated in periodic tick
and the idle path at the LLC domain basis. This cpumask
will also be used in SIS as a filter, improving idle cpu
searching.
Tests are done in an Intel Xeon E5-2650 v4 server with
2 NUMA nodes each of which has 12 cores, and with SMT2
enabled, so 48 CPUs in total. Test results are listed
as follows.
- we used perf messaging test to test throughput
at different load (groups).
perf bench sched messaging -g [N] -l 40000
N w/o w/ diff
1 2.897 2.834 -2.17%
3 5.156 4.904 -4.89%
5 7.850 7.617 -2.97%
10 15.140 14.574 -3.74%
20 29.387 27.602 -6.07%
the result shows approximate 2~6% improvement.
- and schbench to test latency performance in two
scenarios: quiet and noisy. In quiet test, we
run schbench in a normal cpu cgroup in a quiet
system, while the noisy test additionally runs
perf messaging workload inside an idle cgroup
as nosie.
schbench -m 2 -t 24 -i 60 -r 60
perf bench sched messaging -g 1 -l 4000000
[quiet]
w/o w/
50.0th 31 31
75.0th 45 45
90.0th 55 55
95.0th 62 61
*99.0th 85 86
99.5th 565 318
99.9th 11536 10992
max 13029 13067
[nosiy]
w/o w/
50.0th 34 32
75.0th 48 45
90.0th 58 55
95.0th 65 61
*99.0th 2364 208
99.5th 6696 2068
99.9th 12688 8816
max 15209 14191
it can be seen that the quiet test results are
quite similar, but the p99 latency is greatly
improved in the nosiy test.
Comments and tests are appreciated!
Abel Wu (5):
sched/fair: record overloaded cpus
sched/fair: introduce sched-idle balance
sched/fair: add stats for sched-idle balancing
sched/fair: filter out overloaded cpus in sis
sched/fair: favor cpu capacity for idle tasks
include/linux/sched/idle.h | 1 +
include/linux/sched/topology.h | 15 ++++
kernel/sched/core.c | 1 +
kernel/sched/fair.c | 187 ++++++++++++++++++++++++++++++++++++++++-
kernel/sched/sched.h | 6 ++
kernel/sched/stats.c | 5 +-
kernel/sched/topology.c | 4 +-
7 files changed, 215 insertions(+), 4 deletions(-)
--
2.11.0
Powered by blists - more mailing lists