[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <ffcff197-122f-c5bd-91f4-0ff8600be04d@oracle.com>
Date: Thu, 9 Jul 2020 19:08:05 -0400
From: chris hyser <chris.hyser@...cle.com>
To: Parth Shah <parth@...ux.ibm.com>,
Patrick Bellasi <patrick.bellasi@...bug.net>,
LKML <linux-kernel@...r.kernel.org>
Cc: Ingo Molnar <mingo@...hat.com>,
"Peter Zijlstra (Intel)" <peterz@...radead.org>,
Vincent Guittot <vincent.guittot@...aro.org>,
Juri Lelli <juri.lelli@...hat.com>,
Paul Turner <pjt@...gle.com>, Ben Segall <bsegall@...gle.com>,
Thomas Gleixner <tglx@...utronix.de>,
Jonathan Corbet <corbet@....net>,
Dhaval Giani <dhaval.giani@...cle.com>,
Dietmar Eggemann <dietmar.eggemann@....com>,
Josef Bacik <jbacik@...com>
Subject: [SchedulerWakeupLatency] Skipping Idle Cores and CPU Search
> A) Name:
Skipping Idle Cores and CPU Search
> B) Target behavior:
Finding idle CPUs in the CFS scheduler for scheduling awakened tasks increases system throughput at the expense of
additional wake-up latency. For the majority of processes this is a reasonable trade-off. Some communication tasks are
sufficiently sensitive to latency as well as sufficiently short lived (the search time on large systems exceeding the
typical run before sleep time of the comm task itself) to warrant skipping the search. While a real-time priority in a
correctly set up environment might remove the latency, real-time has its own issues, in particular with cgroup
interoperability/configuration.
> C) Existing control paths: reference to code paths to justify B)
--------------------------------------------------------------
select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int target)
nr = INT_MAX
avg_idle = this_rq()->avg_idle / 512;
avg_cost = this_sd->avg_scan_cost + 1;
if (sched_feat(SIS_AVG_CPU) && avg_idle < avg_cost)
return -1;
if (sched_feat(SIS_PROP))
u64 span_avg = sd->span_weight * avg_idle;
if (span_avg > 4*avg_cost)
nr = div_u64(span_avg, avg_cost);
else
nr = 4;
// conditionally use a per-task tunable to adjust 'nr' or return 'target'.
cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr);
for_each_cpu_wrap(cpu, cpus, target)
if (!--nr)
return -1;
if (available_idle_cpu(cpu) || sched_idle_cpu(cpu))
break;
return cpu;
-------------------------------------------------------------
select_idle_sibling(struct task_struct *p, int prev, int target)
// if NOT per-task tunable set. ie the default for tasks.
i = select_idle_core(p, sd, target);
if ((unsigned)i < nr_cpumask_bits)
return i;
i = select_idle_cpu(p, sd, target);
if ((unsigned)i < nr_cpumask_bits)
return i;
i = select_idle_smt(p, target);
if ((unsigned)i < nr_cpumask_bits)
return i;
return target;
-------------------------------------------------------------
> D) Desired behavior:
Reduce the maximum wake-up latency of designated CFS tasks by skipping some or all of the idle CPU and core searches by
setting a maximum idle CPU search value (maximum loop iterations).
Searching 'ALL' as the maximum would be the default and implies the current code path which may or may not search up to
ALL. Searching 0 would result in the least latency (shown with experimental results to be included if/when patchset goes
up). One of the considerations is that the maximum length of the search is a function of the size of the LLC scheduling
domain and this is platform dependent. Whether 'some', i.e. a numerical value limiting the search can be used to
"normalize" this latency across differing scheduling domain sizes is under investigation. Clearly differing hardware
will have many other significant differences, but in different sized and dynamically sized VMs running on fleets of
common HW this may be interesting.
> E/F) Existing knobs (and limitations):
There are existing sched_feat: SIS_AVG_CPU, SIS_PROP that attempt to short circuit the idle cpu search path in
select_idle_cpu() based on estimations of the current costs of searching. Neither provides a means of task-based
selectivity potentially shortening the search for an unimportant tasks and lengthening the one for a critical task. This
also highlights the difference between lowering average or tail-end latencies for all tasks versus as desired here,
specific targeted ones. Not surprisingly, latency reduction here, as in general, comes at the expense of system
throughput, in this case unbalanced-ness in the LLC scheduling domain.
The proposal here allows maximum reduction of the search time for specific tasks. Keeping the number small has minimal
impact (though not zero) on system throughput as the vast majority of tasks are subject to the existing searches.
> G) Proportionality Analysis:
While the immediate use case suggests that one either cares about the latency of a task or not, maximally do 'ALL' or 0
searches -- presumably one generally wants maximum latency reduction -- the nature of the problem (see pseudo code)
allows precise specification of the maximum number of CPUs to search placing an upper limit on search latency.
> H) Range Analysis:
The knob is a positive integer representing "max number of CPUs to search". The default would be 'ALL' which could be
translated as INT_MAX. '0 searches' translates to 0. Other values represent a max limit on the search, in this case
iterations of a for loop.
> I) System-Wide tuning:
No system-wide tuning required.
> J) Per-Task tuning:
The proposal is a per-task single positive integer representing the maximum number of CPUs to search. sched_setscheduler
would allow both 'nice' and this value to be set together which is a reasonably expected use case for latency sensitive
tasks.
> K) Task-Group tuning:
As the intention here is to limit this behavior to specific identified tasks, and overall performance can be negatively
affected if too many or the wrong tasks are selected, this use case does not appear to really need task groupings,
hierarchical relationships, or child inheritance.
-chrish
Powered by blists - more mailing lists