linux-kernel - Re: [PATCH] sched/deadline: Add sched

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20140120131616.GB8907@austad.us>
Date:	Mon, 20 Jan 2014 14:16:17 +0100
From:	Henrik Austad <henrik@...tad.us>
To:	Juri Lelli <juri.lelli@...il.com>
Cc:	peterz@...radead.org, tglx@...utronix.de, mingo@...hat.com,
	rostedt@...dmis.org, oleg@...hat.com, fweisbec@...il.com,
	darren@...art.com, johan.eker@...csson.com, p.faure@...tech.ch,
	linux-kernel@...r.kernel.org, claudio@...dence.eu.com,
	michael@...rulasolutions.com, fchecconi@...il.com,
	tommaso.cucinotta@...up.it, nicola.manica@...i.unitn.it,
	luca.abeni@...tn.it, dhaval.giani@...il.com, hgu1972@...il.com,
	paulmck@...ux.vnet.ibm.com, raistlin@...ux.it,
	insop.song@...il.com, liming.wang@...driver.com, jkacur@...hat.com,
	harald.gustafsson@...csson.com, vincent.guittot@...aro.org,
	bruce.ashfield@...driver.com, rob@...dley.net
Subject: Re: [PATCH] sched/deadline: Add sched_dl documentation

On Mon, Jan 20, 2014 at 01:15:51PM +0100, Juri Lelli wrote:
> On 01/20/2014 12:24 PM, Henrik Austad wrote:
> > On Mon, Jan 20, 2014 at 11:40:40AM +0100, Juri Lelli wrote:
> >> From: Dario Faggioli <raistlin@...ux.it>
> >>
> >> Add in Documentation/scheduler/ some hints about the design
> >> choices, the usage and the future possible developments of the
> >> sched_dl scheduling class and of the SCHED_DEADLINE policy.
> >>
> >> Cc: bruce.ashfield@...driver.com
> >> Cc: claudio@...dence.eu.com
> >> Cc: darren@...art.com
> >> Cc: dhaval.giani@...il.com
> >> Cc: fchecconi@...il.com
> >> Cc: fweisbec@...il.com
> >> Cc: harald.gustafsson@...csson.com
> >> Cc: hgu1972@...il.com
> >> Cc: insop.song@...il.com
> >> Cc: jkacur@...hat.com
> >> Cc: johan.eker@...csson.com
> >> Cc: liming.wang@...driver.com
> >> Cc: luca.abeni@...tn.it
> >> Cc: michael@...rulasolutions.com
> >> Cc: mingo@...hat.com
> >> Cc: nicola.manica@...i.unitn.it
> >> Cc: oleg@...hat.com
> >> Cc: paulmck@...ux.vnet.ibm.com
> >> Cc: p.faure@...tech.ch
> >> Cc: rob@...dley.net
> >> Cc: rostedt@...dmis.org
> >> Cc: tglx@...utronix.de
> >> Cc: tommaso.cucinotta@...up.it
> >> Cc: vincent.guittot@...aro.org
> >> Signed-off-by: Dario Faggioli <raistlin@...ux.it>
> >> Signed-off-by: Juri Lelli <juri.lelli@...il.com>
> >> Signed-off-by: Peter Zijlstra <peterz@...radead.org>
> >> ---
> >>  Documentation/scheduler/00-INDEX           |    2 +
> >>  Documentation/scheduler/sched-deadline.txt |  189 ++++++++++++++++++++++++++++
> >>  kernel/sched/deadline.c                    |    3 +-
> >>  3 files changed, 193 insertions(+), 1 deletion(-)
> >>  create mode 100644 Documentation/scheduler/sched-deadline.txt
> >>
> >> diff --git a/Documentation/scheduler/00-INDEX b/Documentation/scheduler/00-INDEX
> >> index d2651c4..46702e4 100644
> >> --- a/Documentation/scheduler/00-INDEX
> >> +++ b/Documentation/scheduler/00-INDEX
> >> @@ -10,5 +10,7 @@ sched-nice-design.txt
> >>  	- How and why the scheduler's nice levels are implemented.
> >>  sched-rt-group.txt
> >>  	- real-time group scheduling.
> >> +sched-deadline.txt
> >> +	- deadline scheduling.
> >>  sched-stats.txt
> >>  	- information on schedstats (Linux Scheduler Statistics).
> >> diff --git a/Documentation/scheduler/sched-deadline.txt b/Documentation/scheduler/sched-deadline.txt
> >> new file mode 100644
> >> index 0000000..8980de1
> >> --- /dev/null
> >> +++ b/Documentation/scheduler/sched-deadline.txt
> >> @@ -0,0 +1,189 @@
> >> +			  Deadline Task Scheduling
> >> +			  ------------------------
> >> +
> >> +CONTENTS
> >> +========
> >> +
> >> + 0. WARNING
> >> + 1. Overview
> >> + 2. Task scheduling
> >> + 2. The Interface
> >> + 3. Bandwidth management
> >> +   3.1 System-wide settings
> >> +   3.2 Task interface
> >> +   3.4 Default behavior
> >> + 4. Tasks CPU affinity
> >> +   4.1 SCHED_DEADLINE and cpusets HOWTO
> >> + 5. Future plans
> >> +
> >> +
> >> +0. WARNING
> >> +==========
> >> +
> >> + Fiddling with these settings can result in an unpredictable or even unstable
> >> + system behavior. As for -rt (group) scheduling, it is assumed that root users
> >> + know what they're doing.
> >> +
> >> +
> >> +1. Overview
> >> +===========
> >> +
> >> + The SCHED_DEADLINE policy contained inside the sched_dl scheduling class is
> >> + basically an implementation of the Earliest Deadline First (EDF) scheduling
> >> + algorithm, augmented with a mechanism (called Constant Bandwidth Server, CBS)
> >> + that makes it possible to isolate the behavior of tasks between each other.
> > 
> > 
> > Why not something along the lines of giving a task a guaranteed slice of 
> > the CPU as well as making sure that a task takes no more than a given 
> > slice? I.e. making the point of a lower as well as an upper limit of CPU 
> > usage.
> > 
> 
> I'd keep the term "isolate" in, as is one of the strong points on having all
> this merged in. But, we could add something along the lines you suggested:
> 
> "that makes it possible to isolate the behavior of tasks between each other.
> IOW, isolation means that we can reserve a task a guaranteed percentage of the
> CPU and, at the same time, we ensure that the task takes no more than the
> percentage reserved."

+1 from me!

> >> +2. Task scheduling
> >> +==================
> >> +
> >> + The typical -deadline task is composed of a computation phase (instance)
> >> + which is activated on a periodic or sporadic fashion. The expected (maximum)
> >> + duration of such computation is called the task's runtime; the time interval
> >> + by which each instance needs to be completed is called the task's relative
> >> + deadline. The task's absolute deadline is dynamically calculated as the
> >> + time instant a task (or, more properly) activates plus the relative
> >> + deadline.
> > 
> > activates - released?
> > 
> 
> I'd keep (modifying a bit):
> 
> "time instant a task activates plus the relative deadline."
> 
> This is probably the nearest thing to what is implemented that we can say
> (without entering into the theory too much), a task that "activates" can mean
> that it is first released, enqueued, woken-up, etc.

So, if we look at release (yes, I'm avoiding activates for a little while) 
as the time at the *beginning* of a new period, then, and only then should 
the *absolute* deadline be computed.

If you den move on to use 'activate' as a term for when a task becomes 
eligble to run, then 'release' becomes a subset of 'activate', and you 
should only compute the absolute deadline at that time. Did that make 
sense?

> > Since real-time papers from different rt-campus around the academia insist 
> > on using *slightly* different terminology, perhaps add a short dictionary 
> > for some of the more common terms?
> > 
> > D: relative deadline, typically N ms after release
> > d: absolute deadline, the physical time when a given instance of a job 
> >    needs to be completed
> > R: relative release time, for periodic tasks, this is typically 'every N 
> >    ms'
> > r: absolute release time
> > C: Worst-case execution time
> > 
> >    ...you get the idea.
> > 
> > Perhaps too academic?
> > 
> 
> Mmm.. we don't go too deep in theory (we just refer to papers below), could
> adding a dictionary only complicate things? I mean, if you add a term you have
> to explain its meaning related to the task-model you are using. And this means
> you have to also define the task model and so on. Who wants more details
> already finds them in the papers below.

Honestly, how many readers of sched_deadline.txt are going to read those 
papers to have the terms defined?

Either way, having a *short* list of terms we use _in_ sched_deadline would 
help avoiding confusion. For instance, what 'activate' actually means.

> >> + The EDF[1] algorithm selects the task with the smallest absolute deadline as
> >> + the one to be executed first, while the CBS[2,3] ensures that each task runs
> >> + for at most its runtime every period, avoiding any interference between
> >> + different tasks (bandwidth isolation).
> >> + Thanks to this feature, also tasks that do not strictly comply with the
> >> + computational model described above can effectively use the new policy.
> >> + IOW, there are no limitations on what kind of task can exploit this new
> >> + scheduling discipline, even if it must be said that it is particularly
> >> + suited for periodic or sporadic tasks that need guarantees on their
> >> + timing behavior, e.g., multimedia, streaming, control applications, etc.
> > 
> > I assume that ties are broken arbitrarily and that a running task is not 
> > preempted for another task with equal deadline. Correct?
> > 
> 
> Yes.
> 
> > This would be a nice point to include in this doc methinks.

how about adding something like that?

"""
Whenever 2 (or more) tasks happen to have the exact same absolute deadine, 
ties are broken arbitrarily. One notable exception occur when one of the 
tasks are already running, in which case no preemption occurs.
"""

> >> + References:
> >> +  1 - C. L. Liu and J. W. Layland. Scheduling algorithms for multiprogram-
> >> +      ming in a hard-real-time environment. Journal of the Association for
> >> +      Computing Machinery, 20(1), 1973.
> >> +  2 - L. Abeni , G. Buttazzo. Integrating Multimedia Applications in Hard
> >> +      Real-Time Systems. Proceedings of the 19th IEEE Real-time Systems
> >> +      Symposium, 1998. http://retis.sssup.it/~giorgio/paps/1998/rtss98-cbs.pdf
> >> +  3 - L. Abeni. Server Mechanisms for Multimedia Applications. ReTiS Lab
> >> +      Technical Report. http://xoomer.virgilio.it/lucabe72/pubs/tr-98-01.ps
> >> +
> >> +3. Bandwidth management
> >> +=======================
> >> +
> >> + In order for the -deadline scheduling to be effective and useful, it is
> >> + important to have some method to keep the allocation of the available CPU
> >> + bandwidth to the tasks under control.
> >> + This is usually called "admission control" and if it is not performed at all,
> >> + no guarantee can be given on the actual scheduling of the -deadline tasks.
> >> +
> >> + Since when RT-throttling has been introduced each task group has a bandwidth
> >> + associated, calculated as a certain amount of runtime over a period.
> >> + Moreover, to make it possible to manipulate such bandwidth, readable/writable
> >> + controls have been added to both procfs (for system wide settings) and cgroupfs
> >> + (for per-group settings).
> >> + Therefore, the same interface is being used for controlling the bandwidth
> >> + distrubution to -deadline tasks.
> >> +
> >> + However, more discussion is needed in order to figure out how we want to manage
> >> + SCHED_DEADLINE bandwidth at the task group level. Therefore, SCHED_DEADLINE
> >> + uses (for now) a less sophisticated, but actually very sensible, mechanism to
> >> + ensure that a certain utilization cap is not overcome per each root_domain.
> >> +
> >> + Another main difference between deadline bandwidth management and RT-throttling
> >> + is that -deadline tasks have bandwidth on their own (while -rt ones don't!),
> >> + and thus we don't need an higher level throttling mechanism to enforce the
> >> + desired bandwidth.
> >> +
> >> +3.1 System wide settings
> >> +------------------------
> >> +
> >> + The system wide settings are configured under the /proc virtual file system.
> >> +
> >> + For now the -rt knobs are used for dl admission control and the -deadline
> >> + runtime is accounted against the -rt runtime. We realise that this isn't
> >> + entirely desirable; however, it is better to have a small interface for now,
> >> + and be able to change it easily later. The ideal situation (see 5.) is to run
> >> + -rt tasks from a -deadline server; in which case the -rt bandwidth is a direct
> >> + subset of dl_bw.
> >> +
> >> + This means that, for a root_domain comprising M CPUs, -deadline tasks
> >> + can be created while the sum of their bandwidths stays below:
> >> +
> >> +   M * (sched_rt_runtime_us / sched_rt_period_us)
> >> +
> >> + It is also possible to disable this bandwidth management logic, and
> >> + be thus free of oversubscribing the system up to any arbitrary level.
> >> + This is done by writing -1 in /proc/sys/kernel/sched_rt_runtime_us.
> >> +
> >> +
> >> +3.2 Task interface
> >> +------------------
> >> +
> >> + Specifying a periodic/sporadic task that executes for a given amount of
> >> + runtime at each instance, and that is scheduled according to the urgency of
> >> + its own timing constraints needs, in general, a way of declaring:
> >> +  - a (maximum/typical) instance execution time,
> >> +  - a minimum interval between consecutive instances,
> >> +  - a time constraint by which each instance must be completed.
> >> +
> >> + Therefore:
> >> +  * a new struct sched_attr, containing all the necessary fields is
> >> +    provided;
> >> +  * the new scheduling related syscalls that manipulate it, i.e.,
> >> +    sched_setattr() and sched_getattr() are implemented.
> >> +
> >> +
> >> +3.3 Default behavior
> >> +---------------------
> >> +
> >> + The default value for SCHED_DEADLINE bandwidth is to have rt_runtime equal to
> >> + 95000. With rt_period equal to 1000000, by default, it means that -deadline
> >     ^^^^
> >     This seems to be 9.5% to me ;)
> > 
> 
> Argh! s/95000/950000/
> 
> >> + tasks can use at most 95%, multiplied by the number of CPUs that compose the
> >> + root_domain, for each root_domain.
> >> +
> >> + A -deadline task cannot fork.
> >> +
> >> +4. Tasks CPU affinity
> >> +=====================
> >> +
> >> + -deadline tasks cannot have an affinity mask smaller that the entire
> >> + root_domain they are created on. However, affinities can be specified
> >> + through the cpuset facility (Documentation/cgroups/cpusets.txt).
> > 
> > Does this mean that sched_deadline is a somewhat global implementation? Or 
> > rather, at what point in time will sched_deadline take all cpus in a set 
> > into consideration and when will it only look at the current CPU? Where is 
> > the line drawn between global and fully partitioned?
> > 
> > Also, how do you account the budget when a resource holder is boosted in 
> > order to release a resource? (IIRC, you use BWI, right?)
> > 
> 
> Peter already replied about this.
> 
> Thanks,
> 
> - Juri

Feel free to add a reviewed-by from me on this one :)

-- 
Henrik Austad

Download attachment "signature.asc" of type "application/pgp-signature" (199 bytes)