[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <1609106d05a6a4a5938233e993548510f599d7d9.camel@surriel.com>
Date: Fri, 31 Jul 2020 03:42:10 -0400
From: Rik van Riel <riel@...riel.com>
To: Peter Zijlstra <peterz@...radead.org>
Cc: Paul Turner <pjt@...gle.com>,
"vincent.guittot" <vincent.guittot@...aro.org>, kernel-team@...com,
"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
"dietmar.eggeman" <dietmar.eggeman@....com>
Subject: CFS flat runqueue proposal fixes/update
Hello,
last year at Linux Plumbers conference, I presented on my work
of turning the hierarchical CFS runqueue into a flat runqueue,
and Paul Turner pointed out some corner cases that could not
work with my design as it was last year.
Paul pointed out two corner cases, and I have come up with a
third one myself, but I believe they all revolve around the
same issue, admission control, and can all be solved with the
same solution.
This is a fairly complex thing, so this email is long and
in 3 parts:
- hierarchical runqueue problem statement
- quick design overview of the flat runqueue for CFS
- overview of the problems pointed out by Paul Turner
- description of a possible solution
hierarchical runqueue problem statement
Currently CFS with the cgroups CPU controller uses a
hierarchical design, which means that a task is enqueued
on its cgroup's runqueu, that cgroup is enqueued on its
parent's runqueue, etc all the way up to the root runqueue.
This also means that every time a task in a cgroup wakes
up or goes to sleep, the common case (1 task on the CPU)
is that the entire hierarchy is enqueued or dequeued,
resulting in significant overhead for workloads with lots
of context switches.
For example, the hierarchy in a system with equal priority
cgroups A and B, with cgroup A subdivided in two unequal
priority cgroups A1 and B2, and a task in each leaf cgroup
would look like this:
/\
/ \
A B
/ \ \
A1 A2 t3
/ \
t1 t2
The common case on most systems is that CPUs are idle
some of the time, and CPUs usually have 0, 1, or 2
running tasks at a time.
That means when task t1 wakes up when the CPU is idle,
t1 first gets enqueued on A1, which then gets enqueued
on A2, which then gets enqueued on the root runqueue.
When t1 goes back to sleep, the whole thing is torn back
down.
In case all three tasks are runnable at the same time,
the scheduler will first choose between A and B in the
root runqueue, and, in case it chose A, between A1 and A2
the next level down.
CFS flat runqueue design
The code I presented last year operates on the principle
of separating determining hierarchical priority of a task,
which is done periodically, from the runqueues, and using
the hierarchical priority to scale the vruntime of a task
like is done for nice levels.
The hierarchical priority of a tasks is the fraction of
the weight at each level in the runqueue hierarchy of the
path to that task. In the example above, with equal priority
levels for cgroups A, B, A1, and A2, and a total weight 1000
for the root runqueue, tasks t1, t2, and t3 will have hierarchical
weights of 250, 250, and 500 respectively.
CFS tracks both wall clock run time and vruntime for each
task, where vruntime is the wall clock run time scaled
according to the task's priority. The vruntime of every
entity on a runqueue advances at the same rate.
The vruntime is calculated as follows:
vruntime = exec_time * FIXED_POINT_ONE / priority
That is, if the priority of the task is equal to the
fixed point constant used, the vruntime corresponds
to the executed wall clock time, while a lower priority
task has its vruntime advance at a faster rate, and
a higher priority task has its vruntime advance slower.
With every entity on a runqueue having its vruntime
advance at the same rate, that translates into higher
priority tasks receiving more CPU time, and lower
priority tasks receiving less CPU time.
Problems identified by Paul Turner
Lets revisit the hierarchy from above, and assign priorities
to the cgroups, with the fixed point one being 1000. Lets
say cgroups A, A1, and B have priority 1000, while cgroup
A2 has priority 1.
/\
/ \
A B
/ \ \
A1 A2 t3
/ \
t1 t2
One consequence of this is that when t1, t2, and t3 each
get a time slice, the vruntime of tasks t1 and t3 advances
at roughly the same speed as the clock time, while the
vruntime of task t2 advances 1000x faster.
This is fine if all three tasks continue to be runnable,
since t1, t2 and t3 each get their fair share of CPU time.
However, if t1 goes to sleep, t2 is the only thing running
inside cgroup A, which has the same priority as cgroup B,
and tasks t2 and t3 should be getting the same amount of
CPU time.
They eventually will, but not before task t3 has used up
enough CPU time to catch up with the enormous vruntime
advance that t2 just suffered.
That needs to be fixed, to get near-immediate convergence,
and not convergence after some unknown (potentially long)
period of time.
There are similar issues with thundering herds and cgroup
cpu.max throttling, where a large number of tasks can be
woken up simultaneously.
/\
/ \
A B
/ \
t1 t2 t3 t4 t5 ...
t42
In the current, hierarchical runqueue setup, CFS will
time slice between cgroups A and B at the top level,
which means task t1 will never go long without being
scheduled. Only the tasks in cgroup B suffer the
latency effects of cgroup B being overloaded with way
too many runnable tasks.
A similar issue can happen when a cgroup is throttled
with cpu.max, and wakes up with a large number of
runnable tasks inside.
It is desirable that fairness, both from a latency and
a throughput point of view, converge quickly rather
than slowly.
Possible solution
A possible solution to the problems above consists of
three parts:
- Instead of accounting all used wall clock CPU time into
vruntime immediately, at the task's current hierarchical
priority, the vruntime can be accounted for piecemeal, for
example in amounts corresponding to "one timeslice at full
priority".
- If there is only one runnable task on the runqueue,
all the runtime can be accounted into vruntime in one go.
- Tasks that cannot account all of their used CPU time
into vruntime at once can be removed from the root runqueue,
and placed into the cgroup runqueue. A heap of cgroup
runqueues with "overloaded" tasks can be attached to the
main runqueue, where the left-most task from that heap of
heaps gets some vruntime accounted every time we go into
pick_next_task.
- The difference between the vruntime of each task in that
heap and the vruntime of the root runqueue can help determine
how much vruntime can be accounted to that task at once.
- If the task, or its runqueue, is no longer the left-most
in the heap after getting vruntime accounted, that runqueue
and the queue of runqueues can be resorted.
- Once a task has accounted all of its outstanding delta
exec runtime into vruntime, it can be moved back to the
main runqueue.
- This should solve the unequal task weight scenario Paul
Turner pointed out last year, because after task t1 goes
to sleep and only t2 and t3 remain on the CPU, t2 will
get its delta exec runtime converted into vruntime at
its new priority (equal to t3).
- By only accounting delta exec runtime to vruntime for
the left-most task in the "overloaded" heap at one time,
we guarantee only one task at a time will be added back
into the root runqueue.
- Every time a task is added to the root runqueue, that
slows down the rate at which vruntime advances, which
in turn reduces the rate at which tasks get added back
into the runqueue, and makes it more likely that a currently
running task with low hierarchical priority gets booted
off into the "overloaded" heap.
To tackle the thundering herd at task wakeup time, another
strategy may be needed. One thing we may be able to do
there is place tasks into the "overloaded" heap immediately
on wakeup, if the hierarchical priority of the task is so
low that if the task were to run a minimal timeslice length,
it would be unable to account that time into its vruntime
in one go, AND the CPU already has a larger number of tasks
on it.
Because the common case on most systems is having just
0, 1, or 2 runnable tasks on a CPU, this fancy scheme
should rarely end up being used, and even when it is the
overhead should be reasonable because most of the
overloaded tasks will just sit there until pick_next_task
gets around to them.
Does this seem like something worth trying out?
Did I overlook any other corner cases that would make this
approach unworkable?
Did I forget to explain anything that is needed to help
understand the problem and the proposed solution better?
--
All Rights Reversed.
Download attachment "signature.asc" of type "application/pgp-signature" (489 bytes)
Powered by blists - more mailing lists