linux-kernel - [SchedulerTaskPacking] Small background task packing

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <39cc4666-6355-fb9f-654d-e85e1852bc6f@linux.ibm.com>
Date:   Thu, 9 Jul 2020 17:37:25 +0530
From:   Parth Shah <parth@...ux.ibm.com>
To:     Patrick Bellasi <patrick.bellasi@...bug.net>,
        LKML <linux-kernel@...r.kernel.org>
Cc:     Ingo Molnar <mingo@...hat.com>,
        "Peter Zijlstra (Intel)" <peterz@...radead.org>,
        Vincent Guittot <vincent.guittot@...aro.org>,
        Juri Lelli <juri.lelli@...hat.com>,
        Paul Turner <pjt@...gle.com>, Ben Segall <bsegall@...gle.com>,
        Thomas Gleixner <tglx@...utronix.de>,
        Jonathan Corbet <corbet@....net>,
        Dhaval Giani <dhaval.giani@...cle.com>,
        Dietmar Eggemann <dietmar.eggemann@....com>,
        Josef Bacik <jbacik@...com>,
        Chris Hyser <chris.hyser@...cle.com>
Subject: [SchedulerTaskPacking] Small background task packing

> A) Name:

Small background task packing

> B) Target behaviour:

All fair task wakeup follows a procedure of finding an idle CPU and
waking the task on this idle CPU. There are two major wakeup paths:
1. Slow-path: Wake up the task on an idle CPU which is in the shallowest
idle states by searching in the highest sched_domain flagged with
SD_BALANCE_FORK or SD_BALANCE_EXEC.
2. Fast-path: Wake up the task on an idle CPU in the LLC sched_domain of
the waker CPU. There are optimization to bias task placement on prev_cpu or
wake_cpu of the task. This path is majorly used except in few cases like
during fork() and exec().

This assumption of picking an idle CPU is fair in-order to uniformly
consume system resources. But not all tasks deserves to wakeup an idle core
as it can hurt power consumption. For e.g. like small background tasks
waking up an idle core and runs only for very small duration. Reducing
number of active cores allows busier cores to boost frequency and hence
saving power can also result in performance benefits.

There is no mechanism in existing wake up path to detect small
background tasks which can be packed on fewer cores.

> C) Existing control paths:

fair:: .select_task_rq = select_task_rq_fair

fair::select_task_rq_fair()
    // 1) Slow-path: find idle CPU with shallowest idle states
    find_idlest_cpu()

    // 2) Fast-path: find idle CPU
    fair:select_idle_sibling()
        // Wake up on idle core if available
	fair:select_idle_core()
	// Else wakeup on idle CPU if available
	fair:select_idle_cpu()
	fair:select_idle_smt()


There are multiple ways to call fair:select_task_rq_fair();

// 1) try_to_wake_up which should lead to fast-path
   core::try_to_wake_up()
      cpu = select_task_rq(p, p->wake_cpu, SD_BALANCE_WAKE, wake_flags);

// 2) wake_up_new_task which should lead to slow-path
   core::wake_up_new_task()
      __set_task_cpu(p, select_task_rq(p, task_cpu(p), SD_BALANCE_FORK,0));

// 3) sched_exec which should lead to slow-path
   core::sched_exec()
      dest_cpu = p->sched_class->select_task_rq(p, task_cpu(p),
SD_BALANCE_EXEC, 0);

> D) Desired behaviour:

Latency tolerant tasks with small utilization should be packed
on a busy core rather than waking up a new core/CPU.

Upon detecting small-background tasks, different code-path can be used to
search for a busy core as described below;

sched/fair.c:
static inline bool is_small_bg_task(struct task_struct *p)
{
	if (is_latency_tolerant(p) &&
	    (task_util(p) > (SCHED_CAPACITY_SCALE >> 3)))
		return true;

	return false;
}

sched/fair.c: in select_task_rq_fair()

if (sd_flag & SD_BALANCE_WAKE) {
	if (is_small_bg_task(p)) {
		// New proposed wakeup path to do task packing
		new_cpu = select_non_idle_core(p, prev_cpu);
		if (new_cpu >= 0)
			return new_cpu;
	}
}

where select_non_idle_core() searches for a busy core already running some
tasks and selects an idle CPU in that busy core to pack the waking task.

Complete code can be found on TurboSched patches [1].

> E) Existing knobs (if any): reference to whatever existing tunable

There are no existing knob which can hint the scheduler about the latency
nature of task. Detecting latency nature of the task can help in
classifying task as small background tasks to be packed on fewer number of
cores.

There are user-space hacks to do task packing for background tasks with the
use of cpuset.cpus or sched_setaffinity() to manually affine the process to
fewer  dedicated cores.

> F) Existing knobs (if any): one paragraph description of the limitations

Sysadmin/user has to define cpumask for each process (aka task pinning)
which is static in nature. There are multiple limitations to pin the small
background tasks;

- In presence of just one small background task, there is no power
consumption benefit here. It would be preferable to pin it to busy CPU.

- If a task changes the behavior in its life-cycle then sysadmin will have
to manually pin/unpin such tasks. This is limitation of user to classify
tasks as only "background "one and cannot say if and when it will be
"small" in utilization.

- Sysadmin cannot change the task's affinity mask based on the scheduling
pattern to give most optimal performance and energy consumption.

> G) Proportionality Analysis: check the nature of the target behavior

Task packing has to be noninvasive in nature, meaning only the tasks which
are latency tolerant should be packed on fewer cores. Providing this
behavior needs a run-time tunable knob which can hint the scheduler on
whether the waking task can be packed or not.

Upon detecting the nature of the task, a specific wakeup path can be followed:
1. On latency-tolerant tasks with low utilization, a new proposed
scheduling wakeup path will be followed to do packing
2. On latency-sensitive task, an exiting approach of wakeup can be used.

> H) Range Analysis: identify meaningful ranges

The new knob can be binary input accepting 0/1, where 0 means
latency-sensitive and 1 means latency-tolerant task.

Latency-sensitive tasks (value = 0) can search idle CPU in only the llc
sched_domain while the latency-tolerance (value = 1) tasks can go across
llc sched_domain (just like in slow-path) but here in-order to search for a
busy core.

Mapping analysis:
================
The latency_nice proposal [2] offers a [-20, 19] range which can be
mapped into a binary switch, e.g. using a threshold based function.

However, it is possible to extend the concept of finding busy core by
limiting the time spent on searching based on the latency_nice value from
range [-20, 19] where value of 19 indicates searching in the whole chip for
a busy core, whereas value of 10 could mean search for half of the cores in
the chip.

> I) System-Wide tuning: which knobs are required

The latency_nice provided knobs should be enough to get the desired
effect.

> J) Per-Task tuning: which knobs are required

sched_setscheduler() is sufficient.

> K) Task-Group tuning: which knobs are required

A single attribute classifying task-group as latency_nice or
latency_tolerant is sufficient.


> .:: References
> ==============

[1] [RFC v6 0/5] TurboSched: A scheduler for sustaining Turbo Frequencies
    for longer durations
    https://lkml.org/lkml/2020/1/21/39
[2] [PATCH v5 0/4] Introduce per-task latency_nice for scheduler hints
    Message-ID: 20200228090755.22829-1-parth@...ux.ibm.com
    https://lore.kernel.org/lkml/20200228090755.22829-1-parth@linux.ibm.com
[3] TurboSched: the return of small-task packing
    https://lwn.net/Articles/792471/
    https://www.phoronix.com/scan.php?page=news_item&px=Linux-TurboSched-V4