[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <877cmtxh77.fsf@oracle.com>
Date: Tue, 07 Nov 2023 15:43:40 -0800
From: Ankur Arora <ankur.a.arora@...cle.com>
To: Steven Rostedt <rostedt@...dmis.org>
Cc: Ankur Arora <ankur.a.arora@...cle.com>,
linux-kernel@...r.kernel.org, tglx@...utronix.de,
peterz@...radead.org, torvalds@...ux-foundation.org,
paulmck@...nel.org, linux-mm@...ck.org, x86@...nel.org,
akpm@...ux-foundation.org, luto@...nel.org, bp@...en8.de,
dave.hansen@...ux.intel.com, hpa@...or.com, mingo@...hat.com,
juri.lelli@...hat.com, vincent.guittot@...aro.org,
willy@...radead.org, mgorman@...e.de, jon.grimm@....com,
bharata@....com, raghavendra.kt@....com,
boris.ostrovsky@...cle.com, konrad.wilk@...cle.com,
jgross@...e.com, andrew.cooper3@...rix.com, mingo@...nel.org,
bristot@...nel.org, mathieu.desnoyers@...icios.com,
geert@...ux-m68k.org, glaubitz@...sik.fu-berlin.de,
anton.ivanov@...bridgegreys.com, mattst88@...il.com,
krypton@...ich-teichert.org, David.Laight@...LAB.COM,
richard@....at, mjguzik@...il.com
Subject: Re: [RFC PATCH 00/86] Make the kernel preemptible
Steven Rostedt <rostedt@...dmis.org> writes:
> On Tue, 7 Nov 2023 13:56:46 -0800
> Ankur Arora <ankur.a.arora@...cle.com> wrote:
>
>> Hi,
>
> Hi Ankur,
>
> Thanks for doing this!
>
>>
>> We have two models of preemption: voluntary and full (and RT which is
>> a fuller form of full preemption.) In this series -- which is based
>> on Thomas' PoC (see [1]), we try to unify the two by letting the
>> scheduler enforce policy for the voluntary preemption models as well.
>
> I would say there's "NONE" which is really just a "voluntary" but with
> fewer preemption points ;-) But still should be mentioned, otherwise people
> may get confused.
>
>>
>> (Note that this is about preemption when executing in the kernel.
>> Userspace is always preemptible.)
>>
>
>
>> Design
>> ==
>>
>> As Thomas outlines in [1], to unify the preemption models we
>> want to: always have the preempt_count enabled and allow the scheduler
>> to drive preemption policy based on the model in effect.
>>
>> Policies:
>>
>> - preemption=none: run to completion
>> - preemption=voluntary: run to completion, unless a task of higher
>> sched-class awaits
>> - preemption=full: optimized for low-latency. Preempt whenever a higher
>> priority task awaits.
>>
>> To do this add a new flag, TIF_NEED_RESCHED_LAZY which allows the
>> scheduler to mark that a reschedule is needed, but is deferred until
>> the task finishes executing in the kernel -- voluntary preemption
>> as it were.
>>
>> The TIF_NEED_RESCHED flag is evaluated at all three of the preemption
>> points. TIF_NEED_RESCHED_LAZY only needs to be evaluated at ret-to-user.
>>
>> ret-to-user ret-to-kernel preempt_count()
>> none Y N N
>> voluntary Y Y Y
>> full Y Y Y
>
> Wait. The above is for when RESCHED_LAZY is to preempt, right?
>
> Then, shouldn't voluntary be:
>
> voluntary Y N N
>
> For LAZY, but
>
> voluntary Y Y Y
>
> For NEED_RESCHED (without lazy)
Yes. You are, of course, right. I was talking about the TIF_NEED_RESCHED flags
and in the middle switched to talking about how the voluntary model will
get to what it wants.
> That is, the only difference between voluntary and none (as you describe
> above) is that when an RT task wakes up, on voluntary, it sets NEED_RESCHED,
> but on none, it still sets NEED_RESCHED_LAZY?
Yeah exactly. Just to restate without mucking it up:
The TIF_NEED_RESCHED flag is evaluated at all three of the preemption
points. TIF_NEED_RESCHED_LAZY only needs to be evaluated at ret-to-user.
ret-to-user ret-to-kernel preempt_count()
NEED_RESCHED_LAZY Y N N
NEED_RESCHED Y Y Y
Based on how various preemption models set the flag they would cause
preemption at:
ret-to-user ret-to-kernel preempt_count()
none Y N N
voluntary Y Y Y
full Y Y Y
>> The max-load numbers (not posted here) also behave similarly.
>
> It would be interesting to run any "latency sensitive" benchmarks.
>
> I wounder how cyclictest would work under each model with and without this
> patch?
Didn't post these numbers because I suspect that code isn't quite right,
but voluntary preemption for instance does what it promises:
# echo NO_FORCE_PREEMPT > sched/features
# echo NO_PREEMPT_PRIORITY > sched/features # preempt=none
# stress-ng --cyclic 1 --timeout 10
stress-ng: info: [1214172] setting to a 10 second run per stressor
stress-ng: info: [1214172] dispatching hogs: 1 cyclic
stress-ng: info: [1214174] cyclic: sched SCHED_DEADLINE: 100000 ns delay, 10000 samples
stress-ng: info: [1214174] cyclic: mean: 9834.56 ns, mode: 3495 ns
stress-ng: info: [1214174] cyclic: min: 2413 ns, max: 3145065 ns, std.dev. 77096.98
stress-ng: info: [1214174] cyclic: latency percentiles:
stress-ng: info: [1214174] cyclic: 25.00%: 3366 ns
stress-ng: info: [1214174] cyclic: 50.00%: 3505 ns
stress-ng: info: [1214174] cyclic: 75.00%: 3776 ns
stress-ng: info: [1214174] cyclic: 90.00%: 4316 ns
stress-ng: info: [1214174] cyclic: 95.40%: 10989 ns
stress-ng: info: [1214174] cyclic: 99.00%: 91181 ns
stress-ng: info: [1214174] cyclic: 99.50%: 290477 ns
stress-ng: info: [1214174] cyclic: 99.90%: 1360837 ns
stress-ng: info: [1214174] cyclic: 99.99%: 3145065 ns
stress-ng: info: [1214172] successful run completed in 10.00s
# echo PREEMPT_PRIORITY > features # preempt=voluntary
# stress-ng --cyclic 1 --timeout 10
stress-ng: info: [916483] setting to a 10 second run per stressor
stress-ng: info: [916483] dispatching hogs: 1 cyclic
stress-ng: info: [916484] cyclic: sched SCHED_DEADLINE: 100000 ns delay, 10000 samples
stress-ng: info: [916484] cyclic: mean: 3682.77 ns, mode: 3185 ns
stress-ng: info: [916484] cyclic: min: 2523 ns, max: 150082 ns, std.dev. 2198.07
stress-ng: info: [916484] cyclic: latency percentiles:
stress-ng: info: [916484] cyclic: 25.00%: 3185 ns
stress-ng: info: [916484] cyclic: 50.00%: 3306 ns
stress-ng: info: [916484] cyclic: 75.00%: 3666 ns
stress-ng: info: [916484] cyclic: 90.00%: 4778 ns
stress-ng: info: [916484] cyclic: 95.40%: 5359 ns
stress-ng: info: [916484] cyclic: 99.00%: 6141 ns
stress-ng: info: [916484] cyclic: 99.50%: 7824 ns
stress-ng: info: [916484] cyclic: 99.90%: 29825 ns
stress-ng: info: [916484] cyclic: 99.99%: 150082 ns
stress-ng: info: [916483] successful run completed in 10.01s
This is with a background kernbench half-load.
Let me see if I can dig out the numbers without this series.
--
ankur
Powered by blists - more mailing lists