linux-kernel - Re: [RFC][PATCH 15/15] sched/eevdf: Use sched_attr::sched

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20230608115258.GJ998233@hirez.programming.kicks-ass.net>
Date:   Thu, 8 Jun 2023 13:52:58 +0200
From:   Peter Zijlstra <peterz@...radead.org>
To:     Vincent Guittot <vincent.guittot@...aro.org>
Cc:     mingo@...nel.org, linux-kernel@...r.kernel.org,
        juri.lelli@...hat.com, dietmar.eggemann@....com,
        rostedt@...dmis.org, bsegall@...gle.com, mgorman@...e.de,
        bristot@...hat.com, corbet@....net, qyousef@...alina.io,
        chris.hyser@...cle.com, patrick.bellasi@...bug.net, pjt@...gle.com,
        pavel@....cz, qperret@...gle.com, tim.c.chen@...ux.intel.com,
        joshdon@...gle.com, timj@....org, kprateek.nayak@....com,
        yu.c.chen@...el.com, youssefesmat@...omium.org,
        joel@...lfernandes.org, efault@....de, tglx@...utronix.de
Subject: Re: [RFC][PATCH 15/15] sched/eevdf: Use sched_attr::sched_runtime to
 set request/slice

On Thu, Jun 01, 2023 at 03:55:18PM +0200, Vincent Guittot wrote:
> On Wed, 31 May 2023 at 14:47, Peter Zijlstra <peterz@...radead.org> wrote:
> >
> > As an alternative to the latency-nice interface; allow applications to
> > directly set the request/slice using sched_attr::sched_runtime.
> >
> > The implementation clamps the value to: 0.1[ms] <= slice <= 100[ms]
> > which is 1/10 the size of HZ=1000 and 10 times the size of HZ=100.
> 
> There were some discussions about the latency interface and setting a
> raw time value. The problems with using a raw time value are:

So yeah, I'm well aware of that. And I'm not saying this is a better
interface -- just an alternative.

> - what  does this raw time value mean ? and how it applies to the
> scheduling latency of the task. Typically what does setting
> sched_runtime to 1ms means ? Regarding the latency, users would expect
> to be scheduled in less than 1ms but this is not what will (always)
> happen with a sched_slice set to 1ms whereas we ensure that the task
> will run for sched_runtime in the sched_period (and before
> sched_deadline) when using it with deadline scheduler. so this will be
> confusing

Confusing only if you don't know how to look at it; users are confused
in general and that's unfixable, nature will always invent a better
moron. The best we can do is provide enough clues for someone that does
know what he's doing.

So let me start by explaining how such an interface could be used and
how to look at it.

(and because we all love steady state things; I too shall use it)

Consider 4 equal-weight always running tasks (A,B,C,D) with a default
slice of 1ms.  The perfect schedule for this is a straight up FIFO
rotation of the 4 tasks, 1ms each for a total period of 4ms.

  ABCDABCD...
  +---+---+---+---

By keeping the tasks in the same order, we ensure the max latency is the
min latency -- consistency is king. If for one period you were to
say flip the first and last tasks in the order, your max latency takes a
hit, the task that was first will now have to wait 7ms instead of it's
usual 3ms.

  ABCDDBCA...
  +---+---+---+---

So far so obvious and boring..

Now, is we were to change the slice of task D to 2ms, what happens is
that it can't run the first time, because the slice rotations are 1ms,
and it needs 2ms, so it needs to save up and bank the first slot, so you
get a schedule like:

  ABCABCDDABCABCDD...
  +---+---+---+---+---

And here you can see that the total period becomes 8 (N*r_max). The
period for the 1ms tasks is still 4ms -- on average, but the period for
the 2ms task is 8ms.


A more complex example would be 3 tasks: A(w=1,r=1), B(w=1,r=1),
C(w=2,r=1) [to keep the 4ms period]:

  CCABCCAB...
  +---+---+---+---

If we change the slice of B to 2 then it becomes:

  CCACCABBCCACCABB...
  +---+---+---+---+---

So the total period is W*r_max (8ms), each task will average to a period
of W*r_i and each task will get the fair share of w_i/W time over the
total period (W*r_max per previous).

> - more than a runtime, we want to set a scheduling latency hint which
> would be more aligned with a deadline

We all wants ponies ;-) But seriously if you have a real deadline, use
SCHED_DEADLINE.

> - Then the user will complain that he set 1ms but its task is
> scheduled after several (or even dozens) ms in some cases. Also, you
> will probably end up with everybody setting 0.1ms and expecting 0.1ms
> latency. The latency nice like the nice give an opaque weight against
> others without any determinism that we can't respect

Now, notably I used sched_attr::sched_runtime, not _deadline nor _period.
Runtime is how long you expect each job-execution to take (WCET and all
that) in a periodic or sporadic task model.

Given this is a best effort overcommit scheduling class, we *CANNOT*
guarantee actual latency. The best we can offer is consistency (and this
is where EEVDF is *much* better than CFS).

We cannot, and must not pretend to provide a real deadline; hence we
should really not use that term in the user interface for this.


>From the above examples we can see that if you ask for 1ms slices, you
get 1ms slices spaced (on average) closer together than if you were to
ask for 2ms slices -- even though they end up with the same share of
CPU-time.

Per the previous argument, the 2ms slice task has to forgo one slot in
first period to bank and save up for a 2ms slot in a super period.

How, if you're not a CPU hogging bully and don't use much CPU time at
all (your music player etc..) then setting the slice length to what it
actually takes to decode the next sample buffer, you can likely get a
smaller average period.

Conversely, if you ask for a slice significantly smaller than your job
execution time, you'll see it'll get split up in smaller chunks and
suffer preemption.

> - How do you set that you don't want to preempt others ? But still
> want to keep your allocated running time.

SCHED_BATCH is what we have for that. That actually works.