linux-kernel - Re: [RFC PATCH 00/86] Make the kernel preemptible

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <877cmtxh77.fsf@oracle.com>
Date:   Tue, 07 Nov 2023 15:43:40 -0800
From:   Ankur Arora <ankur.a.arora@...cle.com>
To:     Steven Rostedt <rostedt@...dmis.org>
Cc:     Ankur Arora <ankur.a.arora@...cle.com>,
        linux-kernel@...r.kernel.org, tglx@...utronix.de,
        peterz@...radead.org, torvalds@...ux-foundation.org,
        paulmck@...nel.org, linux-mm@...ck.org, x86@...nel.org,
        akpm@...ux-foundation.org, luto@...nel.org, bp@...en8.de,
        dave.hansen@...ux.intel.com, hpa@...or.com, mingo@...hat.com,
        juri.lelli@...hat.com, vincent.guittot@...aro.org,
        willy@...radead.org, mgorman@...e.de, jon.grimm@....com,
        bharata@....com, raghavendra.kt@....com,
        boris.ostrovsky@...cle.com, konrad.wilk@...cle.com,
        jgross@...e.com, andrew.cooper3@...rix.com, mingo@...nel.org,
        bristot@...nel.org, mathieu.desnoyers@...icios.com,
        geert@...ux-m68k.org, glaubitz@...sik.fu-berlin.de,
        anton.ivanov@...bridgegreys.com, mattst88@...il.com,
        krypton@...ich-teichert.org, David.Laight@...LAB.COM,
        richard@....at, mjguzik@...il.com
Subject: Re: [RFC PATCH 00/86] Make the kernel preemptible


Steven Rostedt <rostedt@...dmis.org> writes:

> On Tue,  7 Nov 2023 13:56:46 -0800
> Ankur Arora <ankur.a.arora@...cle.com> wrote:
>
>> Hi,
>
> Hi Ankur,
>
> Thanks for doing this!
>
>>
>> We have two models of preemption: voluntary and full (and RT which is
>> a fuller form of full preemption.) In this series -- which is based
>> on Thomas' PoC (see [1]), we try to unify the two by letting the
>> scheduler enforce policy for the voluntary preemption models as well.
>
> I would say there's "NONE" which is really just a "voluntary" but with
> fewer preemption points ;-) But still should be mentioned, otherwise people
> may get confused.
>
>>
>> (Note that this is about preemption when executing in the kernel.
>> Userspace is always preemptible.)
>>
>
>
>> Design
>> ==
>>
>> As Thomas outlines in [1], to unify the preemption models we
>> want to: always have the preempt_count enabled and allow the scheduler
>> to drive preemption policy based on the model in effect.
>>
>> Policies:
>>
>> - preemption=none: run to completion
>> - preemption=voluntary: run to completion, unless a task of higher
>>   sched-class awaits
>> - preemption=full: optimized for low-latency. Preempt whenever a higher
>>   priority task awaits.
>>
>> To do this add a new flag, TIF_NEED_RESCHED_LAZY which allows the
>> scheduler to mark that a reschedule is needed, but is deferred until
>> the task finishes executing in the kernel -- voluntary preemption
>> as it were.
>>
>> The TIF_NEED_RESCHED flag is evaluated at all three of the preemption
>> points. TIF_NEED_RESCHED_LAZY only needs to be evaluated at ret-to-user.
>>
>>          ret-to-user    ret-to-kernel    preempt_count()
>> none           Y              N                N
>> voluntary      Y              Y                Y
>> full           Y              Y                Y
>
> Wait. The above is for when RESCHED_LAZY is to preempt, right?
>
> Then, shouldn't voluntary be:
>
>  voluntary      Y              N                N
>
> For LAZY, but
>
>  voluntary      Y              Y                Y
>
> For NEED_RESCHED (without lazy)

Yes. You are, of course, right. I was talking about the TIF_NEED_RESCHED flags
and in the middle switched to talking about how the voluntary model will
get to what it wants.

> That is, the only difference between voluntary and none (as you describe
> above) is that when an RT task wakes up, on voluntary, it sets NEED_RESCHED,
> but on none, it still sets NEED_RESCHED_LAZY?

Yeah exactly. Just to restate without mucking it up:

The TIF_NEED_RESCHED flag is evaluated at all three of the preemption
points. TIF_NEED_RESCHED_LAZY only needs to be evaluated at ret-to-user.

                  ret-to-user    ret-to-kernel    preempt_count()
NEED_RESCHED_LAZY    Y              N                N
NEED_RESCHED         Y              Y                Y

Based on how various preemption models set the flag they would cause
preemption at:

                  ret-to-user    ret-to-kernel    preempt_count()
none                 Y              N                N
voluntary            Y              Y                Y
full                 Y              Y                Y

>>   The max-load numbers (not posted here) also behave similarly.
>
> It would be interesting to run any "latency sensitive" benchmarks.
>
> I wounder how cyclictest would work under each model with and without this
> patch?

Didn't post these numbers because I suspect that code isn't quite right,
but voluntary preemption for instance does what it promises:

# echo NO_FORCE_PREEMPT  > sched/features
# echo NO_PREEMPT_PRIORITY > sched/features    # preempt=none
# stress-ng --cyclic 1  --timeout 10
stress-ng: info:  [1214172] setting to a 10 second run per stressor
stress-ng: info:  [1214172] dispatching hogs: 1 cyclic
stress-ng: info:  [1214174] cyclic: sched SCHED_DEADLINE: 100000 ns delay, 10000 samples
stress-ng: info:  [1214174] cyclic:   mean: 9834.56 ns, mode: 3495 ns
stress-ng: info:  [1214174] cyclic:   min: 2413 ns, max: 3145065 ns, std.dev. 77096.98
stress-ng: info:  [1214174] cyclic: latency percentiles:
stress-ng: info:  [1214174] cyclic:   25.00%:       3366 ns
stress-ng: info:  [1214174] cyclic:   50.00%:       3505 ns
stress-ng: info:  [1214174] cyclic:   75.00%:       3776 ns
stress-ng: info:  [1214174] cyclic:   90.00%:       4316 ns
stress-ng: info:  [1214174] cyclic:   95.40%:      10989 ns
stress-ng: info:  [1214174] cyclic:   99.00%:      91181 ns
stress-ng: info:  [1214174] cyclic:   99.50%:     290477 ns
stress-ng: info:  [1214174] cyclic:   99.90%:    1360837 ns
stress-ng: info:  [1214174] cyclic:   99.99%:    3145065 ns
stress-ng: info:  [1214172] successful run completed in 10.00s

# echo PREEMPT_PRIORITY > features    # preempt=voluntary
# stress-ng --cyclic 1  --timeout 10
stress-ng: info:  [916483] setting to a 10 second run per stressor
stress-ng: info:  [916483] dispatching hogs: 1 cyclic
stress-ng: info:  [916484] cyclic: sched SCHED_DEADLINE: 100000 ns delay, 10000 samples
stress-ng: info:  [916484] cyclic:   mean: 3682.77 ns, mode: 3185 ns
stress-ng: info:  [916484] cyclic:   min: 2523 ns, max: 150082 ns, std.dev. 2198.07
stress-ng: info:  [916484] cyclic: latency percentiles:
stress-ng: info:  [916484] cyclic:   25.00%:       3185 ns
stress-ng: info:  [916484] cyclic:   50.00%:       3306 ns
stress-ng: info:  [916484] cyclic:   75.00%:       3666 ns
stress-ng: info:  [916484] cyclic:   90.00%:       4778 ns
stress-ng: info:  [916484] cyclic:   95.40%:       5359 ns
stress-ng: info:  [916484] cyclic:   99.00%:       6141 ns
stress-ng: info:  [916484] cyclic:   99.50%:       7824 ns
stress-ng: info:  [916484] cyclic:   99.90%:      29825 ns
stress-ng: info:  [916484] cyclic:   99.99%:     150082 ns
stress-ng: info:  [916483] successful run completed in 10.01s

This is with a background kernbench half-load.

Let me see if I can dig out the numbers without this series.

--
ankur