[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAKfTPtAxcdQ=bNapyvsrG_Cb6p8h2ut4=waiSPmay+tNxF8fZQ@mail.gmail.com>
Date: Mon, 20 Jul 2020 17:20:44 +0200
From: Vincent Guittot <vincent.guittot@...aro.org>
To: Parth Shah <parth@...ux.ibm.com>
Cc: Patrick Bellasi <patrick.bellasi@...bug.net>,
LKML <linux-kernel@...r.kernel.org>,
Ingo Molnar <mingo@...hat.com>,
"Peter Zijlstra (Intel)" <peterz@...radead.org>,
Juri Lelli <juri.lelli@...hat.com>,
Paul Turner <pjt@...gle.com>, Ben Segall <bsegall@...gle.com>,
Thomas Gleixner <tglx@...utronix.de>,
Jonathan Corbet <corbet@....net>,
Dhaval Giani <dhaval.giani@...cle.com>,
Dietmar Eggemann <dietmar.eggemann@....com>,
Josef Bacik <jbacik@...com>,
Chris Hyser <chris.hyser@...cle.com>
Subject: Re: [SchedulerTaskPacking] Small background task packing
Hi Parth,
On Thu, 9 Jul 2020 at 14:09, Parth Shah <parth@...ux.ibm.com> wrote:
>
> > A) Name:
>
> Small background task packing
>
> > B) Target behaviour:
>
> All fair task wakeup follows a procedure of finding an idle CPU and
> waking the task on this idle CPU. There are two major wakeup paths:
> 1. Slow-path: Wake up the task on an idle CPU which is in the shallowest
> idle states by searching in the highest sched_domain flagged with
> SD_BALANCE_FORK or SD_BALANCE_EXEC.
> 2. Fast-path: Wake up the task on an idle CPU in the LLC sched_domain of
> the waker CPU. There are optimization to bias task placement on prev_cpu or
> wake_cpu of the task. This path is majorly used except in few cases like
> during fork() and exec().
>
> This assumption of picking an idle CPU is fair in-order to uniformly
> consume system resources. But not all tasks deserves to wakeup an idle core
> as it can hurt power consumption. For e.g. like small background tasks
> waking up an idle core and runs only for very small duration. Reducing
> number of active cores allows busier cores to boost frequency and hence
> saving power can also result in performance benefits.
>
> There is no mechanism in existing wake up path to detect small
> background tasks which can be packed on fewer cores.
>
> > C) Existing control paths:
>
> fair:: .select_task_rq = select_task_rq_fair
>
> fair::select_task_rq_fair()
> // 1) Slow-path: find idle CPU with shallowest idle states
> find_idlest_cpu()
>
> // 2) Fast-path: find idle CPU
> fair:select_idle_sibling()
> // Wake up on idle core if available
> fair:select_idle_core()
> // Else wakeup on idle CPU if available
> fair:select_idle_cpu()
> fair:select_idle_smt()
>
>
> There are multiple ways to call fair:select_task_rq_fair();
>
> // 1) try_to_wake_up which should lead to fast-path
> core::try_to_wake_up()
> cpu = select_task_rq(p, p->wake_cpu, SD_BALANCE_WAKE, wake_flags);
>
> // 2) wake_up_new_task which should lead to slow-path
> core::wake_up_new_task()
> __set_task_cpu(p, select_task_rq(p, task_cpu(p), SD_BALANCE_FORK,0));
>
> // 3) sched_exec which should lead to slow-path
> core::sched_exec()
> dest_cpu = p->sched_class->select_task_rq(p, task_cpu(p),
> SD_BALANCE_EXEC, 0);
>
> > D) Desired behaviour:
>
> Latency tolerant tasks with small utilization should be packed
> on a busy core rather than waking up a new core/CPU.
>
> Upon detecting small-background tasks, different code-path can be used to
> search for a busy core as described below;
>
> sched/fair.c:
> static inline bool is_small_bg_task(struct task_struct *p)
> {
> if (is_latency_tolerant(p) &&
> (task_util(p) > (SCHED_CAPACITY_SCALE >> 3)))
This condition seems to be quite arbitrary and probably works on your
current platform but other platforms might want another threshold. Is
there a way to know how much utilization can be added to a CPU which
would not use more capacity than the extra capacity that is provided
by the turbo mode ?
Also you may want to use task_util_est() instead of task_util() to
make sure that the small background task has low chance to become a
large one and not use more capacity than the one provided by turbo at
the end
> return true;
>
> return false;
> }
>
> sched/fair.c: in select_task_rq_fair()
>
> if (sd_flag & SD_BALANCE_WAKE) {
> if (is_small_bg_task(p)) {
> // New proposed wakeup path to do task packing
> new_cpu = select_non_idle_core(p, prev_cpu);
> if (new_cpu >= 0)
> return new_cpu;
> }
> }
>
> where select_non_idle_core() searches for a busy core already running some
> tasks and selects an idle CPU in that busy core to pack the waking task.
>
> Complete code can be found on TurboSched patches [1].
>
> > E) Existing knobs (if any): reference to whatever existing tunable
>
> There are no existing knob which can hint the scheduler about the latency
> nature of task. Detecting latency nature of the task can help in
> classifying task as small background tasks to be packed on fewer number of
> cores.
>
> There are user-space hacks to do task packing for background tasks with the
> use of cpuset.cpus or sched_setaffinity() to manually affine the process to
> fewer dedicated cores.
>
> > F) Existing knobs (if any): one paragraph description of the limitations
>
> Sysadmin/user has to define cpumask for each process (aka task pinning)
> which is static in nature. There are multiple limitations to pin the small
> background tasks;
>
> - In presence of just one small background task, there is no power
> consumption benefit here. It would be preferable to pin it to busy CPU.
>
> - If a task changes the behavior in its life-cycle then sysadmin will have
> to manually pin/unpin such tasks. This is limitation of user to classify
> tasks as only "background "one and cannot say if and when it will be
> "small" in utilization.
>
> - Sysadmin cannot change the task's affinity mask based on the scheduling
> pattern to give most optimal performance and energy consumption.
>
> > G) Proportionality Analysis: check the nature of the target behavior
>
> Task packing has to be noninvasive in nature, meaning only the tasks which
> are latency tolerant should be packed on fewer cores. Providing this
> behavior needs a run-time tunable knob which can hint the scheduler on
> whether the waking task can be packed or not.
>
> Upon detecting the nature of the task, a specific wakeup path can be followed:
> 1. On latency-tolerant tasks with low utilization, a new proposed
> scheduling wakeup path will be followed to do packing
> 2. On latency-sensitive task, an exiting approach of wakeup can be used.
>
> > H) Range Analysis: identify meaningful ranges
>
> The new knob can be binary input accepting 0/1, where 0 means
> latency-sensitive and 1 means latency-tolerant task.
>
> Latency-sensitive tasks (value = 0) can search idle CPU in only the llc
> sched_domain while the latency-tolerance (value = 1) tasks can go across
> llc sched_domain (just like in slow-path) but here in-order to search for a
> busy core.
>
> Mapping analysis:
> ================
> The latency_nice proposal [2] offers a [-20, 19] range which can be
> mapped into a binary switch, e.g. using a threshold based function.
>
> However, it is possible to extend the concept of finding busy core by
> limiting the time spent on searching based on the latency_nice value from
> range [-20, 19] where value of 19 indicates searching in the whole chip for
> a busy core, whereas value of 10 could mean search for half of the cores in
> the chip.
>
> > I) System-Wide tuning: which knobs are required
>
> The latency_nice provided knobs should be enough to get the desired
> effect.
>
> > J) Per-Task tuning: which knobs are required
>
> sched_setscheduler() is sufficient.
>
> > K) Task-Group tuning: which knobs are required
>
> A single attribute classifying task-group as latency_nice or
> latency_tolerant is sufficient.
>
>
> > .:: References
> > ==============
>
> [1] [RFC v6 0/5] TurboSched: A scheduler for sustaining Turbo Frequencies
> for longer durations
> https://lkml.org/lkml/2020/1/21/39
> [2] [PATCH v5 0/4] Introduce per-task latency_nice for scheduler hints
> Message-ID: 20200228090755.22829-1-parth@...ux.ibm.com
> https://lore.kernel.org/lkml/20200228090755.22829-1-parth@linux.ibm.com
> [3] TurboSched: the return of small-task packing
> https://lwn.net/Articles/792471/
> https://www.phoronix.com/scan.php?page=news_item&px=Linux-TurboSched-V4
Powered by blists - more mailing lists