[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <39cc4666-6355-fb9f-654d-e85e1852bc6f@linux.ibm.com>
Date: Thu, 9 Jul 2020 17:37:25 +0530
From: Parth Shah <parth@...ux.ibm.com>
To: Patrick Bellasi <patrick.bellasi@...bug.net>,
LKML <linux-kernel@...r.kernel.org>
Cc: Ingo Molnar <mingo@...hat.com>,
"Peter Zijlstra (Intel)" <peterz@...radead.org>,
Vincent Guittot <vincent.guittot@...aro.org>,
Juri Lelli <juri.lelli@...hat.com>,
Paul Turner <pjt@...gle.com>, Ben Segall <bsegall@...gle.com>,
Thomas Gleixner <tglx@...utronix.de>,
Jonathan Corbet <corbet@....net>,
Dhaval Giani <dhaval.giani@...cle.com>,
Dietmar Eggemann <dietmar.eggemann@....com>,
Josef Bacik <jbacik@...com>,
Chris Hyser <chris.hyser@...cle.com>
Subject: [SchedulerTaskPacking] Small background task packing
> A) Name:
Small background task packing
> B) Target behaviour:
All fair task wakeup follows a procedure of finding an idle CPU and
waking the task on this idle CPU. There are two major wakeup paths:
1. Slow-path: Wake up the task on an idle CPU which is in the shallowest
idle states by searching in the highest sched_domain flagged with
SD_BALANCE_FORK or SD_BALANCE_EXEC.
2. Fast-path: Wake up the task on an idle CPU in the LLC sched_domain of
the waker CPU. There are optimization to bias task placement on prev_cpu or
wake_cpu of the task. This path is majorly used except in few cases like
during fork() and exec().
This assumption of picking an idle CPU is fair in-order to uniformly
consume system resources. But not all tasks deserves to wakeup an idle core
as it can hurt power consumption. For e.g. like small background tasks
waking up an idle core and runs only for very small duration. Reducing
number of active cores allows busier cores to boost frequency and hence
saving power can also result in performance benefits.
There is no mechanism in existing wake up path to detect small
background tasks which can be packed on fewer cores.
> C) Existing control paths:
fair:: .select_task_rq = select_task_rq_fair
fair::select_task_rq_fair()
// 1) Slow-path: find idle CPU with shallowest idle states
find_idlest_cpu()
// 2) Fast-path: find idle CPU
fair:select_idle_sibling()
// Wake up on idle core if available
fair:select_idle_core()
// Else wakeup on idle CPU if available
fair:select_idle_cpu()
fair:select_idle_smt()
There are multiple ways to call fair:select_task_rq_fair();
// 1) try_to_wake_up which should lead to fast-path
core::try_to_wake_up()
cpu = select_task_rq(p, p->wake_cpu, SD_BALANCE_WAKE, wake_flags);
// 2) wake_up_new_task which should lead to slow-path
core::wake_up_new_task()
__set_task_cpu(p, select_task_rq(p, task_cpu(p), SD_BALANCE_FORK,0));
// 3) sched_exec which should lead to slow-path
core::sched_exec()
dest_cpu = p->sched_class->select_task_rq(p, task_cpu(p),
SD_BALANCE_EXEC, 0);
> D) Desired behaviour:
Latency tolerant tasks with small utilization should be packed
on a busy core rather than waking up a new core/CPU.
Upon detecting small-background tasks, different code-path can be used to
search for a busy core as described below;
sched/fair.c:
static inline bool is_small_bg_task(struct task_struct *p)
{
if (is_latency_tolerant(p) &&
(task_util(p) > (SCHED_CAPACITY_SCALE >> 3)))
return true;
return false;
}
sched/fair.c: in select_task_rq_fair()
if (sd_flag & SD_BALANCE_WAKE) {
if (is_small_bg_task(p)) {
// New proposed wakeup path to do task packing
new_cpu = select_non_idle_core(p, prev_cpu);
if (new_cpu >= 0)
return new_cpu;
}
}
where select_non_idle_core() searches for a busy core already running some
tasks and selects an idle CPU in that busy core to pack the waking task.
Complete code can be found on TurboSched patches [1].
> E) Existing knobs (if any): reference to whatever existing tunable
There are no existing knob which can hint the scheduler about the latency
nature of task. Detecting latency nature of the task can help in
classifying task as small background tasks to be packed on fewer number of
cores.
There are user-space hacks to do task packing for background tasks with the
use of cpuset.cpus or sched_setaffinity() to manually affine the process to
fewer dedicated cores.
> F) Existing knobs (if any): one paragraph description of the limitations
Sysadmin/user has to define cpumask for each process (aka task pinning)
which is static in nature. There are multiple limitations to pin the small
background tasks;
- In presence of just one small background task, there is no power
consumption benefit here. It would be preferable to pin it to busy CPU.
- If a task changes the behavior in its life-cycle then sysadmin will have
to manually pin/unpin such tasks. This is limitation of user to classify
tasks as only "background "one and cannot say if and when it will be
"small" in utilization.
- Sysadmin cannot change the task's affinity mask based on the scheduling
pattern to give most optimal performance and energy consumption.
> G) Proportionality Analysis: check the nature of the target behavior
Task packing has to be noninvasive in nature, meaning only the tasks which
are latency tolerant should be packed on fewer cores. Providing this
behavior needs a run-time tunable knob which can hint the scheduler on
whether the waking task can be packed or not.
Upon detecting the nature of the task, a specific wakeup path can be followed:
1. On latency-tolerant tasks with low utilization, a new proposed
scheduling wakeup path will be followed to do packing
2. On latency-sensitive task, an exiting approach of wakeup can be used.
> H) Range Analysis: identify meaningful ranges
The new knob can be binary input accepting 0/1, where 0 means
latency-sensitive and 1 means latency-tolerant task.
Latency-sensitive tasks (value = 0) can search idle CPU in only the llc
sched_domain while the latency-tolerance (value = 1) tasks can go across
llc sched_domain (just like in slow-path) but here in-order to search for a
busy core.
Mapping analysis:
================
The latency_nice proposal [2] offers a [-20, 19] range which can be
mapped into a binary switch, e.g. using a threshold based function.
However, it is possible to extend the concept of finding busy core by
limiting the time spent on searching based on the latency_nice value from
range [-20, 19] where value of 19 indicates searching in the whole chip for
a busy core, whereas value of 10 could mean search for half of the cores in
the chip.
> I) System-Wide tuning: which knobs are required
The latency_nice provided knobs should be enough to get the desired
effect.
> J) Per-Task tuning: which knobs are required
sched_setscheduler() is sufficient.
> K) Task-Group tuning: which knobs are required
A single attribute classifying task-group as latency_nice or
latency_tolerant is sufficient.
> .:: References
> ==============
[1] [RFC v6 0/5] TurboSched: A scheduler for sustaining Turbo Frequencies
for longer durations
https://lkml.org/lkml/2020/1/21/39
[2] [PATCH v5 0/4] Introduce per-task latency_nice for scheduler hints
Message-ID: 20200228090755.22829-1-parth@...ux.ibm.com
https://lore.kernel.org/lkml/20200228090755.22829-1-parth@linux.ibm.com
[3] TurboSched: the return of small-task packing
https://lwn.net/Articles/792471/
https://www.phoronix.com/scan.php?page=news_item&px=Linux-TurboSched-V4
Powered by blists - more mailing lists