[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <366a4938-25eb-43a9-8858-64492c081a4f@suse.com>
Date: Wed, 8 Nov 2023 08:31:06 +0100
From: Juergen Gross <jgross@...e.com>
To: Ankur Arora <ankur.a.arora@...cle.com>,
linux-kernel@...r.kernel.org
Cc: tglx@...utronix.de, peterz@...radead.org,
torvalds@...ux-foundation.org, paulmck@...nel.org,
linux-mm@...ck.org, x86@...nel.org, akpm@...ux-foundation.org,
luto@...nel.org, bp@...en8.de, dave.hansen@...ux.intel.com,
hpa@...or.com, mingo@...hat.com, juri.lelli@...hat.com,
vincent.guittot@...aro.org, willy@...radead.org, mgorman@...e.de,
jon.grimm@....com, bharata@....com, raghavendra.kt@....com,
boris.ostrovsky@...cle.com, konrad.wilk@...cle.com,
andrew.cooper3@...rix.com, mingo@...nel.org, bristot@...nel.org,
mathieu.desnoyers@...icios.com, geert@...ux-m68k.org,
glaubitz@...sik.fu-berlin.de, anton.ivanov@...bridgegreys.com,
mattst88@...il.com, krypton@...ich-teichert.org,
rostedt@...dmis.org, David.Laight@...LAB.COM, richard@....at,
mjguzik@...il.com
Subject: Re: [RFC PATCH 00/86] Make the kernel preemptible
On 07.11.23 22:56, Ankur Arora wrote:
> Hi,
>
> We have two models of preemption: voluntary and full (and RT which is
> a fuller form of full preemption.) In this series -- which is based
> on Thomas' PoC (see [1]), we try to unify the two by letting the
> scheduler enforce policy for the voluntary preemption models as well.
>
> (Note that this is about preemption when executing in the kernel.
> Userspace is always preemptible.)
>
> Background
> ==
>
> Why?: both of these preemption mechanisms are almost entirely disjoint.
> There are four main sets of preemption points in the kernel:
>
> 1. return to user
> 2. explicit preemption points (cond_resched() and its ilk)
> 3. return to kernel (tick/IPI/irq at irqexit)
> 4. end of non-preemptible sections at (preempt_count() == preempt_offset)
>
> Voluntary preemption uses mechanisms 1 and 2. Full preemption
> uses 1, 3 and 4. In addition both use cond_resched_{rcu,lock,rwlock*}
> which can be all things to all people because they internally
> contain 2, and 4.
>
> Now since there's no ideal placement of explicit preemption points,
> they tend to be randomly spread over code and accumulate over time,
> as they are are added when latency problems are seen. Plus fear of
> regressions makes them difficult to remove.
> (Presumably, asymptotically they would spead out evenly across the
> instruction stream!)
>
> In voluntary models, the scheduler's job is to match the demand
> side of preemption points (a task that needs to be scheduled) with
> the supply side (a task which calls cond_resched().)
>
> Full preemption models track preemption count so the scheduler can
> always knows if it is safe to preempt and can drive preemption
> itself (ex. via dynamic preemption points in 3.)
>
> Design
> ==
>
> As Thomas outlines in [1], to unify the preemption models we
> want to: always have the preempt_count enabled and allow the scheduler
> to drive preemption policy based on the model in effect.
>
> Policies:
>
> - preemption=none: run to completion
> - preemption=voluntary: run to completion, unless a task of higher
> sched-class awaits
> - preemption=full: optimized for low-latency. Preempt whenever a higher
> priority task awaits.
>
> To do this add a new flag, TIF_NEED_RESCHED_LAZY which allows the
> scheduler to mark that a reschedule is needed, but is deferred until
> the task finishes executing in the kernel -- voluntary preemption
> as it were.
>
> The TIF_NEED_RESCHED flag is evaluated at all three of the preemption
> points. TIF_NEED_RESCHED_LAZY only needs to be evaluated at ret-to-user.
>
> ret-to-user ret-to-kernel preempt_count()
> none Y N N
> voluntary Y Y Y
> full Y Y Y
>
>
> There's just one remaining issue: now that explicit preemption points are
> gone, processes that spread a long time in the kernel have no way to give
> up the CPU.
>
> For full preemption, that is a non-issue as we always use TIF_NEED_RESCHED.
>
> For none/voluntary preemption, we handle that by upgrading to TIF_NEED_RESCHED
> if a task marked TIF_NEED_RESCHED_LAZY hasn't preempted away by the next tick.
> (This would cause preemption either at ret-to-kernel, or if the task is in
> a non-preemptible section, when it exits that section.)
>
> Arguably this provides for much more consistent maximum latency (~2 tick
> lengths + length of non-preemptible section) as compared to the old model
> where the maximum latency depended on the dynamic distribution of
> cond_resched() points.
>
> (As a bonus it handles code that is preemptible but cannot call cond_resched()
> completely trivially: ex. long running Xen hypercalls, or this series
> which started this discussion:
> https://lore.kernel.org/all/20230830184958.2333078-8-ankur.a.arora@oracle.com/)
>
>
> Status
> ==
>
> What works:
> - The system seems to keep ticking over with the normal scheduling policies
> (SCHED_OTHER). The support for the realtime policies is somewhat more
> half baked.)
> - The basic performance numbers seem pretty close to 6.6-rc7 baseline
>
> What's broken:
> - ARCH_NO_PREEMPT (See patch-45 "preempt: ARCH_NO_PREEMPT only preempts
> lazily")
> - Non-x86 architectures. It's trivial to support other archs (only need
> to add TIF_NEED_RESCHED_LAZY) but wanted to hold off until I got some
> comments on the series.
> (From some testing on arm64, didn't find any surprises.)
> - livepatch: livepatch depends on using _cond_resched() to provide
> low-latency patching. That is obviously difficult with cond_resched()
> gone. We could get a similar effect by using a static_key in
> preempt_enable() but at least with inline locks, that might be end
> up bloating the kernel quite a bit.
> - Documentation/ and comments mention cond_resched()
> - ftrace support for need-resched-lazy is incomplete
>
> What needs more discussion:
> - Should cond_resched_lock() etc be scheduling out for TIF_NEED_RESCHED
> only or both TIF_NEED_RESCHED_LAZY as well? (See patch 35 "thread_info:
> change to tif_need_resched(resched_t)")
> - Tracking whether a task in userspace or in the kernel (See patch-40
> "context_tracking: add ct_state_cpu()")
> - The right model for preempt=voluntary. (See patch 44 "sched: voluntary
> preemption")
>
>
> Performance
> ==
>
> Expectation:
>
> * perf sched bench pipe
>
> preemption full none
>
> 6.6-rc7 6.68 +- 0.10 6.69 +- 0.07
> +series 6.69 +- 0.12 6.67 +- 0.10
>
> This is rescheduling out of idle which should and does perform identically.
>
> * schbench, preempt=none
>
> * 1 group, 16 threads each
>
> 6.6-rc7 +series
> (usecs) (usecs)
> 50.0th: 6 6
> 90.0th: 8 7
> 99.0th: 11 11
> 99.9th: 15 14
>
> * 8 groups, 16 threads each
>
> 6.6-rc7 +series
> (usecs) (usecs)
> 50.0th: 6 6
> 90.0th: 8 8
> 99.0th: 12 11
> 99.9th: 20 21
>
>
> * schbench, preempt=full
>
> * 1 group, 16 threads each
>
> 6.6-rc7 +series
> (usecs) (usecs)
> 50.0th: 6 6
> 90.0th: 8 7
> 99.0th: 11 11
> 99.9th: 14 14
>
>
> * 8 groups, 16 threads each
>
> 6.6-rc7 +series
> (usecs) (usecs)
> 50.0th: 7 7
> 90.0th: 9 9
> 99.0th: 12 12
> 99.9th: 21 22
>
>
> Not much in it either way.
>
> * kernbench, preempt=full
>
> * half-load (-j 128)
>
> 6.6-rc7 +series
>
> wall 149.2 +- 27.2 wall 132.8 +- 0.4
> utime 8097.1 +- 57.4 utime 8088.5 +- 14.1
> stime 1165.5 +- 9.4 stime 1159.2 +- 1.9
> %cpu 6337.6 +- 1072.8 %cpu 6959.6 +- 22.8
> csw 237618 +- 2190.6 %csw 240343 +- 1386.8
>
>
> * optimal-load (-j 1024)
>
> 6.6-rc7 +series
>
> wall 137.8 +- 0.0 wall 137.7 +- 0.8
> utime 11115.0 +- 3306.1 utime 11041.7 +- 3235.0
> stime 1340.0 +- 191.3 stime 1323.1 +- 179.5
> %cpu 8846.3 +- 2830.6 %cpu 9101.3 +- 2346.7
> csw 2099910 +- 2040080.0 csw 2068210 +- 2002450.0
>
>
> The preempt=full path should effectively not see any change in
> behaviour. The optimal-loads are pretty much identical.
> For the half-load, however, the +series version does much better but that
> seems to be because of much higher run to run variability in the 6.6-rc7 load.
>
> * kernbench, preempt=none
>
> * half-load (-j 128)
>
> 6.6-rc7 +series
>
> wall 134.5 +- 4.2 wall 133.6 +- 2.7
> utime 8093.3 +- 39.3 utime 8099.0 +- 38.9
> stime 1175.7 +- 10.6 stime 1169.1 +- 8.4
> %cpu 6893.3 +- 233.2 %cpu 6936.3 +- 142.8
> csw 240723 +- 423.0 %csw 173152 +- 1126.8
>
>
> * optimal-load (-j 1024)
>
> 6.6-rc7 +series
>
> wall 139.2 +- 0.3 wall 138.8 +- 0.2
> utime 11161.0 +- 3360.4 utime 11061.2 +- 3244.9
> stime 1357.6 +- 199.3 stime 1366.6 +- 216.3
> %cpu 9108.8 +- 2431.4 %cpu 9081.0 +- 2351.1
> csw 2078599 +- 2013320.0 csw 1970610 +- 1969030.0
>
>
> For both of these the wallclock, utime, stime etc are pretty much
> identical. The one interesting difference is that the number of
> context switches are fewer. This intuitively makes sense given that
> we reschedule threads lazily rather than rescheduling if we encounter
> a cond_resched() and there's a thread wanting to be scheduled.
>
> The max-load numbers (not posted here) also behave similarly.
>
>
> Series
> ==
>
> With that, this is how he series is laid out:
>
> - Patches 01-30: revert the PREEMPT_DYNAMIC code. Most of the infrastructure
> used by that is via static_calls() and this is a simpler approach which
> doesn't need any of that (and does away with cond_resched().)
>
> Some of the commits will be resurrected.
> 089c02ae2771 ("ftrace: Use preemption model accessors for trace header printout")
> cfe43f478b79 ("preempt/dynamic: Introduce preemption model accessors")
> 5693fa74f98a ("kcsan: Use preemption model accessors")
>
> - Patches 31-45: contain the scheduler changes to do this. Of these
> the critical ones are:
> patch 35 "thread_info: change to tif_need_resched(resched_t)"
> patch 41 "sched: handle resched policy in resched_curr()"
> patch 43 "sched: enable PREEMPT_COUNT, PREEMPTION for all preemption models"
> patch 44 "sched: voluntary preemption"
> (this needs more work to decide when a higher sched-policy task
> should preempt a lower sched-policy task)
> patch 45 "preempt: ARCH_NO_PREEMPT only preempts lazily"
>
> - Patches 47-50: contain RCU related changes. RCU now works in both
> PREEMPT_RCU=y and PREEMPT_RCU=n modes with CONFIG_PREEMPTION.
> (Until now PREEMPTION=y => PREEMPT_RCU)
>
> - Patches 51-56,86: contain cond_resched() related cleanups.
> patch 54 "sched: add cond_resched_stall()" adds a new cond_resched()
> interface. Pitchforks?
>
> - Patches 57-86: remove cond_resched() from the tree.
>
>
> Also at: github.com/terminus/linux preemption-rfc
>
>
> Please review.
>
> Thanks
> Ankur
>
> [1] https://lore.kernel.org/lkml/87jzshhexi.ffs@tglx/
>
>
> Ankur Arora (86):
> Revert "riscv: support PREEMPT_DYNAMIC with static keys"
> Revert "sched/core: Make sched_dynamic_mutex static"
> Revert "ftrace: Use preemption model accessors for trace header
> printout"
> Revert "preempt/dynamic: Introduce preemption model accessors"
> Revert "kcsan: Use preemption model accessors"
> Revert "entry: Fix compile error in
> dynamic_irqentry_exit_cond_resched()"
> Revert "livepatch,sched: Add livepatch task switching to
> cond_resched()"
> Revert "arm64: Support PREEMPT_DYNAMIC"
> Revert "sched/preempt: Add PREEMPT_DYNAMIC using static keys"
> Revert "sched/preempt: Decouple HAVE_PREEMPT_DYNAMIC from
> GENERIC_ENTRY"
> Revert "sched/preempt: Simplify irqentry_exit_cond_resched() callers"
> Revert "sched/preempt: Refactor sched_dynamic_update()"
> Revert "sched/preempt: Move PREEMPT_DYNAMIC logic later"
> Revert "preempt/dynamic: Fix setup_preempt_mode() return value"
> Revert "preempt: Restore preemption model selection configs"
> Revert "sched: Provide Kconfig support for default dynamic preempt
> mode"
> sched/preempt: remove PREEMPT_DYNAMIC from the build version
> Revert "preempt/dynamic: Fix typo in macro conditional statement"
> Revert "sched,preempt: Move preempt_dynamic to debug.c"
> Revert "static_call: Relax static_call_update() function argument
> type"
> Revert "sched/core: Use -EINVAL in sched_dynamic_mode()"
> Revert "sched/core: Stop using magic values in sched_dynamic_mode()"
> Revert "sched,x86: Allow !PREEMPT_DYNAMIC"
> Revert "sched: Harden PREEMPT_DYNAMIC"
> Revert "sched: Add /debug/sched_preempt"
> Revert "preempt/dynamic: Support dynamic preempt with preempt= boot
> option"
> Revert "preempt/dynamic: Provide irqentry_exit_cond_resched() static
> call"
> Revert "preempt/dynamic: Provide preempt_schedule[_notrace]() static
> calls"
> Revert "preempt/dynamic: Provide cond_resched() and might_resched()
> static calls"
> Revert "preempt: Introduce CONFIG_PREEMPT_DYNAMIC"
> x86/thread_info: add TIF_NEED_RESCHED_LAZY
> entry: handle TIF_NEED_RESCHED_LAZY
> entry/kvm: handle TIF_NEED_RESCHED_LAZY
> thread_info: accessors for TIF_NEED_RESCHED*
> thread_info: change to tif_need_resched(resched_t)
> entry: irqentry_exit only preempts TIF_NEED_RESCHED
> sched: make test_*_tsk_thread_flag() return bool
> sched: *_tsk_need_resched() now takes resched_t
> sched: handle lazy resched in set_nr_*_polling()
> context_tracking: add ct_state_cpu()
> sched: handle resched policy in resched_curr()
> sched: force preemption on tick expiration
> sched: enable PREEMPT_COUNT, PREEMPTION for all preemption models
> sched: voluntary preemption
> preempt: ARCH_NO_PREEMPT only preempts lazily
> tracing: handle lazy resched
> rcu: select PREEMPT_RCU if PREEMPT
> rcu: handle quiescent states for PREEMPT_RCU=n
> osnoise: handle quiescent states directly
> rcu: TASKS_RCU does not need to depend on PREEMPTION
> preempt: disallow !PREEMPT_COUNT or !PREEMPTION
> sched: remove CONFIG_PREEMPTION from *_needbreak()
> sched: fixup __cond_resched_*()
> sched: add cond_resched_stall()
> xarray: add cond_resched_xas_rcu() and cond_resched_xas_lock_irq()
> xarray: use cond_resched_xas*()
> coccinelle: script to remove cond_resched()
> treewide: x86: remove cond_resched()
> treewide: rcu: remove cond_resched()
> treewide: torture: remove cond_resched()
> treewide: bpf: remove cond_resched()
> treewide: trace: remove cond_resched()
> treewide: futex: remove cond_resched()
> treewide: printk: remove cond_resched()
> treewide: task_work: remove cond_resched()
> treewide: kernel: remove cond_resched()
> treewide: kernel: remove cond_reshed()
> treewide: mm: remove cond_resched()
> treewide: io_uring: remove cond_resched()
> treewide: ipc: remove cond_resched()
> treewide: lib: remove cond_resched()
> treewide: crypto: remove cond_resched()
> treewide: security: remove cond_resched()
> treewide: fs: remove cond_resched()
> treewide: virt: remove cond_resched()
> treewide: block: remove cond_resched()
> treewide: netfilter: remove cond_resched()
> treewide: net: remove cond_resched()
> treewide: net: remove cond_resched()
> treewide: sound: remove cond_resched()
> treewide: md: remove cond_resched()
> treewide: mtd: remove cond_resched()
> treewide: drm: remove cond_resched()
> treewide: net: remove cond_resched()
> treewide: drivers: remove cond_resched()
> sched: remove cond_resched()
I'm missing the removal of the Xen parts, which were one of the reasons to start
this whole work (xen_in_preemptible_hcall etc.).
Juergen
Download attachment "OpenPGP_0xB0DE9DD628BF132F.asc" of type "application/pgp-keys" (3099 bytes)
Download attachment "OpenPGP_signature.asc" of type "application/pgp-signature" (496 bytes)
Powered by blists - more mailing lists