linux-kernel - Re: [RFC PATCH 00/86] Make the kernel preemptible

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <366a4938-25eb-43a9-8858-64492c081a4f@suse.com>
Date:   Wed, 8 Nov 2023 08:31:06 +0100
From:   Juergen Gross <jgross@...e.com>
To:     Ankur Arora <ankur.a.arora@...cle.com>,
        linux-kernel@...r.kernel.org
Cc:     tglx@...utronix.de, peterz@...radead.org,
        torvalds@...ux-foundation.org, paulmck@...nel.org,
        linux-mm@...ck.org, x86@...nel.org, akpm@...ux-foundation.org,
        luto@...nel.org, bp@...en8.de, dave.hansen@...ux.intel.com,
        hpa@...or.com, mingo@...hat.com, juri.lelli@...hat.com,
        vincent.guittot@...aro.org, willy@...radead.org, mgorman@...e.de,
        jon.grimm@....com, bharata@....com, raghavendra.kt@....com,
        boris.ostrovsky@...cle.com, konrad.wilk@...cle.com,
        andrew.cooper3@...rix.com, mingo@...nel.org, bristot@...nel.org,
        mathieu.desnoyers@...icios.com, geert@...ux-m68k.org,
        glaubitz@...sik.fu-berlin.de, anton.ivanov@...bridgegreys.com,
        mattst88@...il.com, krypton@...ich-teichert.org,
        rostedt@...dmis.org, David.Laight@...LAB.COM, richard@....at,
        mjguzik@...il.com
Subject: Re: [RFC PATCH 00/86] Make the kernel preemptible

On 07.11.23 22:56, Ankur Arora wrote:
> Hi,
> 
> We have two models of preemption: voluntary and full (and RT which is
> a fuller form of full preemption.) In this series -- which is based
> on Thomas' PoC (see [1]), we try to unify the two by letting the
> scheduler enforce policy for the voluntary preemption models as well.
> 
> (Note that this is about preemption when executing in the kernel.
> Userspace is always preemptible.)
> 
> Background
> ==
> 
> Why?: both of these preemption mechanisms are almost entirely disjoint.
> There are four main sets of preemption points in the kernel:
> 
>   1. return to user
>   2. explicit preemption points (cond_resched() and its ilk)
>   3. return to kernel (tick/IPI/irq at irqexit)
>   4. end of non-preemptible sections at (preempt_count() == preempt_offset)
> 
> Voluntary preemption uses mechanisms 1 and 2. Full preemption
> uses 1, 3 and 4. In addition both use cond_resched_{rcu,lock,rwlock*}
> which can be all things to all people because they internally
> contain 2, and 4.
> 
> Now since there's no ideal placement of explicit preemption points,
> they tend to be randomly spread over code and accumulate over time,
> as they are are added when latency problems are seen. Plus fear of
> regressions makes them difficult to remove.
> (Presumably, asymptotically they would spead out evenly across the
> instruction stream!)
> 
> In voluntary models, the scheduler's job is to match the demand
> side of preemption points (a task that needs to be scheduled) with
> the supply side (a task which calls cond_resched().)
> 
> Full preemption models track preemption count so the scheduler can
> always knows if it is safe to preempt and can drive preemption
> itself (ex. via dynamic preemption points in 3.)
> 
> Design
> ==
> 
> As Thomas outlines in [1], to unify the preemption models we
> want to: always have the preempt_count enabled and allow the scheduler
> to drive preemption policy based on the model in effect.
> 
> Policies:
> 
> - preemption=none: run to completion
> - preemption=voluntary: run to completion, unless a task of higher
>    sched-class awaits
> - preemption=full: optimized for low-latency. Preempt whenever a higher
>    priority task awaits.
> 
> To do this add a new flag, TIF_NEED_RESCHED_LAZY which allows the
> scheduler to mark that a reschedule is needed, but is deferred until
> the task finishes executing in the kernel -- voluntary preemption
> as it were.
> 
> The TIF_NEED_RESCHED flag is evaluated at all three of the preemption
> points. TIF_NEED_RESCHED_LAZY only needs to be evaluated at ret-to-user.
> 
>           ret-to-user    ret-to-kernel    preempt_count()
> none           Y              N                N
> voluntary      Y              Y                Y
> full           Y              Y                Y
> 
> 
> There's just one remaining issue: now that explicit preemption points are
> gone, processes that spread a long time in the kernel have no way to give
> up the CPU.
> 
> For full preemption, that is a non-issue as we always use TIF_NEED_RESCHED.
> 
> For none/voluntary preemption, we handle that by upgrading to TIF_NEED_RESCHED
> if a task marked TIF_NEED_RESCHED_LAZY hasn't preempted away by the next tick.
> (This would cause preemption either at ret-to-kernel, or if the task is in
> a non-preemptible section, when it exits that section.)
> 
> Arguably this provides for much more consistent maximum latency (~2 tick
> lengths + length of non-preemptible section) as compared to the old model
> where the maximum latency depended on the dynamic distribution of
> cond_resched() points.
> 
> (As a bonus it handles code that is preemptible but cannot call cond_resched()
>   completely trivially: ex. long running Xen hypercalls, or this series
>   which started this discussion:
>   https://lore.kernel.org/all/20230830184958.2333078-8-ankur.a.arora@oracle.com/)
> 
> 
> Status
> ==
> 
> What works:
>   - The system seems to keep ticking over with the normal scheduling policies
>     (SCHED_OTHER). The support for the realtime policies is somewhat more
>     half baked.)
>   - The basic performance numbers seem pretty close to 6.6-rc7 baseline
> 
> What's broken:
>   - ARCH_NO_PREEMPT (See patch-45 "preempt: ARCH_NO_PREEMPT only preempts
>     lazily")
>   - Non-x86 architectures. It's trivial to support other archs (only need
>     to add TIF_NEED_RESCHED_LAZY) but wanted to hold off until I got some
>     comments on the series.
>     (From some testing on arm64, didn't find any surprises.)
>   - livepatch: livepatch depends on using _cond_resched() to provide
>     low-latency patching. That is obviously difficult with cond_resched()
>     gone. We could get a similar effect by using a static_key in
>     preempt_enable() but at least with inline locks, that might be end
>     up bloating the kernel quite a bit.
>   - Documentation/ and comments mention cond_resched()
>   - ftrace support for need-resched-lazy is incomplete
> 
> What needs more discussion:
>   - Should cond_resched_lock() etc be scheduling out for TIF_NEED_RESCHED
>     only or both TIF_NEED_RESCHED_LAZY as well? (See patch 35 "thread_info:
>     change to tif_need_resched(resched_t)")
>   - Tracking whether a task in userspace or in the kernel (See patch-40
>     "context_tracking: add ct_state_cpu()")
>   - The right model for preempt=voluntary. (See patch 44 "sched: voluntary
>     preemption")
> 
> 
> Performance
> ==
> 
> Expectation:
> 
> * perf sched bench pipe
> 
> preemption               full           none
> 
> 6.6-rc7              6.68 +- 0.10   6.69 +- 0.07
> +series              6.69 +- 0.12   6.67 +- 0.10
> 
> This is rescheduling out of idle which should and does perform identically.
> 
> * schbench, preempt=none
> 
>    * 1 group, 16 threads each
> 
>                   6.6-rc7      +series
>                   (usecs)      (usecs)
>       50.0th:         6            6
>       90.0th:         8            7
>       99.0th:        11           11
>       99.9th:        15           14
>    
>    * 8 groups, 16 threads each
> 
>                  6.6-rc7       +series
>                   (usecs)      (usecs)
>       50.0th:         6            6
>       90.0th:         8            8
>       99.0th:        12           11
>       99.9th:        20           21
> 
> 
> * schbench, preempt=full
> 
>    * 1 group, 16 threads each
> 
>                  6.6-rc7       +series
>                  (usecs)       (usecs)
>       50.0th:         6            6
>       90.0th:         8            7
>       99.0th:        11           11
>       99.9th:        14           14
> 
> 
>    * 8 groups, 16 threads each
> 
>                  6.6-rc7       +series
>                  (usecs)       (usecs)
>       50.0th:         7            7
>       90.0th:         9            9
>       99.0th:        12           12
>       99.9th:        21           22
> 
> 
>    Not much in it either way.
> 
> * kernbench, preempt=full
> 
>    * half-load (-j 128)
> 
>             6.6-rc7                                    +series
> 
>    wall        149.2  +-     27.2             wall        132.8  +-     0.4
>    utime      8097.1  +-     57.4             utime      8088.5  +-    14.1
>    stime      1165.5  +-      9.4             stime      1159.2  +-     1.9
>    %cpu       6337.6  +-   1072.8             %cpu       6959.6  +-    22.8
>    csw      237618    +-   2190.6             %csw     240343    +-  1386.8
> 
> 
>    * optimal-load (-j 1024)
> 
>             6.6-rc7                                    +series
> 
>    wall        137.8 +-       0.0             wall       137.7  +-       0.8
>    utime     11115.0 +-    3306.1             utime    11041.7  +-    3235.0
>    stime      1340.0 +-     191.3             stime     1323.1  +-     179.5
>    %cpu       8846.3 +-    2830.6             %cpu      9101.3  +-    2346.7
>    csw     2099910   +- 2040080.0             csw    2068210    +- 2002450.0
> 
> 
>    The preempt=full path should effectively not see any change in
>    behaviour. The optimal-loads are pretty much identical.
>    For the half-load, however, the +series version does much better but that
>    seems to be because of much higher run to run variability in the 6.6-rc7 load.
> 
> * kernbench, preempt=none
> 
>    * half-load (-j 128)
> 
>             6.6-rc7                                    +series
> 
>    wall        134.5  +-      4.2             wall        133.6  +-     2.7
>    utime      8093.3  +-     39.3             utime      8099.0  +-    38.9
>    stime      1175.7  +-     10.6             stime      1169.1  +-     8.4
>    %cpu       6893.3  +-    233.2             %cpu       6936.3  +-   142.8
>    csw      240723    +-    423.0             %csw     173152    +-  1126.8
>                                               
> 
>    * optimal-load (-j 1024)
> 
>             6.6-rc7                                    +series
> 
>    wall        139.2 +-       0.3             wall       138.8  +-       0.2
>    utime     11161.0 +-    3360.4             utime    11061.2  +-    3244.9
>    stime      1357.6 +-     199.3             stime     1366.6  +-     216.3
>    %cpu       9108.8 +-    2431.4             %cpu      9081.0  +-    2351.1
>    csw     2078599   +- 2013320.0             csw    1970610    +- 1969030.0
> 
> 
>    For both of these the wallclock, utime, stime etc are pretty much
>    identical. The one interesting difference is that the number of
>    context switches are fewer. This intuitively makes sense given that
>    we reschedule threads lazily rather than rescheduling if we encounter
>    a cond_resched() and there's a thread wanting to be scheduled.
> 
>    The max-load numbers (not posted here) also behave similarly.
> 
> 
> Series
> ==
> 
> With that, this is how he series is laid out:
> 
>   - Patches 01-30: revert the PREEMPT_DYNAMIC code. Most of the infrastructure
>     used by that is via static_calls() and this is a simpler approach which
>     doesn't need any of that (and does away with cond_resched().)
> 
>     Some of the commits will be resurrected.
>         089c02ae2771 ("ftrace: Use preemption model accessors for trace header printout")
>         cfe43f478b79 ("preempt/dynamic: Introduce preemption model accessors")
>         5693fa74f98a ("kcsan: Use preemption model accessors")
> 
>   - Patches 31-45: contain the scheduler changes to do this. Of these
>     the critical ones are:
>       patch 35 "thread_info: change to tif_need_resched(resched_t)"
>       patch 41 "sched: handle resched policy in resched_curr()"
>       patch 43 "sched: enable PREEMPT_COUNT, PREEMPTION for all preemption models"
>       patch 44 "sched: voluntary preemption"
>        (this needs more work to decide when a higher sched-policy task
>         should preempt a lower sched-policy task)
>       patch 45 "preempt: ARCH_NO_PREEMPT only preempts lazily"
> 
>   - Patches 47-50: contain RCU related changes. RCU now works in both
>     PREEMPT_RCU=y and PREEMPT_RCU=n modes with CONFIG_PREEMPTION.
>     (Until now PREEMPTION=y => PREEMPT_RCU)
> 
>   - Patches 51-56,86: contain cond_resched() related cleanups.
>       patch 54 "sched: add cond_resched_stall()" adds a new cond_resched()
>       interface. Pitchforks?
> 
>   - Patches 57-86: remove cond_resched() from the tree.
> 
> 
> Also at: github.com/terminus/linux preemption-rfc
> 
> 
> Please review.
> 
> Thanks
> Ankur
> 
> [1] https://lore.kernel.org/lkml/87jzshhexi.ffs@tglx/
> 
> 
> Ankur Arora (86):
>    Revert "riscv: support PREEMPT_DYNAMIC with static keys"
>    Revert "sched/core: Make sched_dynamic_mutex static"
>    Revert "ftrace: Use preemption model accessors for trace header
>      printout"
>    Revert "preempt/dynamic: Introduce preemption model accessors"
>    Revert "kcsan: Use preemption model accessors"
>    Revert "entry: Fix compile error in
>      dynamic_irqentry_exit_cond_resched()"
>    Revert "livepatch,sched: Add livepatch task switching to
>      cond_resched()"
>    Revert "arm64: Support PREEMPT_DYNAMIC"
>    Revert "sched/preempt: Add PREEMPT_DYNAMIC using static keys"
>    Revert "sched/preempt: Decouple HAVE_PREEMPT_DYNAMIC from
>      GENERIC_ENTRY"
>    Revert "sched/preempt: Simplify irqentry_exit_cond_resched() callers"
>    Revert "sched/preempt: Refactor sched_dynamic_update()"
>    Revert "sched/preempt: Move PREEMPT_DYNAMIC logic later"
>    Revert "preempt/dynamic: Fix setup_preempt_mode() return value"
>    Revert "preempt: Restore preemption model selection configs"
>    Revert "sched: Provide Kconfig support for default dynamic preempt
>      mode"
>    sched/preempt: remove PREEMPT_DYNAMIC from the build version
>    Revert "preempt/dynamic: Fix typo in macro conditional statement"
>    Revert "sched,preempt: Move preempt_dynamic to debug.c"
>    Revert "static_call: Relax static_call_update() function argument
>      type"
>    Revert "sched/core: Use -EINVAL in sched_dynamic_mode()"
>    Revert "sched/core: Stop using magic values in sched_dynamic_mode()"
>    Revert "sched,x86: Allow !PREEMPT_DYNAMIC"
>    Revert "sched: Harden PREEMPT_DYNAMIC"
>    Revert "sched: Add /debug/sched_preempt"
>    Revert "preempt/dynamic: Support dynamic preempt with preempt= boot
>      option"
>    Revert "preempt/dynamic: Provide irqentry_exit_cond_resched() static
>      call"
>    Revert "preempt/dynamic: Provide preempt_schedule[_notrace]() static
>      calls"
>    Revert "preempt/dynamic: Provide cond_resched() and might_resched()
>      static calls"
>    Revert "preempt: Introduce CONFIG_PREEMPT_DYNAMIC"
>    x86/thread_info: add TIF_NEED_RESCHED_LAZY
>    entry: handle TIF_NEED_RESCHED_LAZY
>    entry/kvm: handle TIF_NEED_RESCHED_LAZY
>    thread_info: accessors for TIF_NEED_RESCHED*
>    thread_info: change to tif_need_resched(resched_t)
>    entry: irqentry_exit only preempts TIF_NEED_RESCHED
>    sched: make test_*_tsk_thread_flag() return bool
>    sched: *_tsk_need_resched() now takes resched_t
>    sched: handle lazy resched in set_nr_*_polling()
>    context_tracking: add ct_state_cpu()
>    sched: handle resched policy in resched_curr()
>    sched: force preemption on tick expiration
>    sched: enable PREEMPT_COUNT, PREEMPTION for all preemption models
>    sched: voluntary preemption
>    preempt: ARCH_NO_PREEMPT only preempts lazily
>    tracing: handle lazy resched
>    rcu: select PREEMPT_RCU if PREEMPT
>    rcu: handle quiescent states for PREEMPT_RCU=n
>    osnoise: handle quiescent states directly
>    rcu: TASKS_RCU does not need to depend on PREEMPTION
>    preempt: disallow !PREEMPT_COUNT or !PREEMPTION
>    sched: remove CONFIG_PREEMPTION from *_needbreak()
>    sched: fixup __cond_resched_*()
>    sched: add cond_resched_stall()
>    xarray: add cond_resched_xas_rcu() and cond_resched_xas_lock_irq()
>    xarray: use cond_resched_xas*()
>    coccinelle: script to remove cond_resched()
>    treewide: x86: remove cond_resched()
>    treewide: rcu: remove cond_resched()
>    treewide: torture: remove cond_resched()
>    treewide: bpf: remove cond_resched()
>    treewide: trace: remove cond_resched()
>    treewide: futex: remove cond_resched()
>    treewide: printk: remove cond_resched()
>    treewide: task_work: remove cond_resched()
>    treewide: kernel: remove cond_resched()
>    treewide: kernel: remove cond_reshed()
>    treewide: mm: remove cond_resched()
>    treewide: io_uring: remove cond_resched()
>    treewide: ipc: remove cond_resched()
>    treewide: lib: remove cond_resched()
>    treewide: crypto: remove cond_resched()
>    treewide: security: remove cond_resched()
>    treewide: fs: remove cond_resched()
>    treewide: virt: remove cond_resched()
>    treewide: block: remove cond_resched()
>    treewide: netfilter: remove cond_resched()
>    treewide: net: remove cond_resched()
>    treewide: net: remove cond_resched()
>    treewide: sound: remove cond_resched()
>    treewide: md: remove cond_resched()
>    treewide: mtd: remove cond_resched()
>    treewide: drm: remove cond_resched()
>    treewide: net: remove cond_resched()
>    treewide: drivers: remove cond_resched()
>    sched: remove cond_resched()

I'm missing the removal of the Xen parts, which were one of the reasons to start
this whole work (xen_in_preemptible_hcall etc.).


Juergen

Download attachment "OpenPGP_0xB0DE9DD628BF132F.asc" of type "application/pgp-keys" (3099 bytes)

Download attachment "OpenPGP_signature.asc" of type "application/pgp-signature" (496 bytes)