linux-kernel - [PATCH 00/30] PREEMPT_AUTO: support lazy rescheduling

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-Id: <20240213055554.1802415-1-ankur.a.arora@oracle.com>
Date: Mon, 12 Feb 2024 21:55:24 -0800
From: Ankur Arora <ankur.a.arora@...cle.com>
To: linux-kernel@...r.kernel.org
Cc: tglx@...utronix.de, peterz@...radead.org, torvalds@...ux-foundation.org,
        paulmck@...nel.org, akpm@...ux-foundation.org, luto@...nel.org,
        bp@...en8.de, dave.hansen@...ux.intel.com, hpa@...or.com,
        mingo@...hat.com, juri.lelli@...hat.com, vincent.guittot@...aro.org,
        willy@...radead.org, mgorman@...e.de, jpoimboe@...nel.org,
        mark.rutland@....com, jgross@...e.com, andrew.cooper3@...rix.com,
        bristot@...nel.org, mathieu.desnoyers@...icios.com,
        geert@...ux-m68k.org, glaubitz@...sik.fu-berlin.de,
        anton.ivanov@...bridgegreys.com, mattst88@...il.com,
        krypton@...ich-teichert.org, rostedt@...dmis.org,
        David.Laight@...LAB.COM, richard@....at, mjguzik@...il.com,
        jon.grimm@....com, bharata@....com, raghavendra.kt@....com,
        boris.ostrovsky@...cle.com, konrad.wilk@...cle.com,
        Ankur Arora <ankur.a.arora@...cle.com>
Subject: [PATCH 00/30] PREEMPT_AUTO: support lazy rescheduling

Hi,

This series adds a new scheduling model PREEMPT_AUTO, which like
PREEMPT_DYNAMIC allows dynamic switching between a none/voluntary/full
preemption model. However, unlike PREEMPT_DYNAMIC, it doesn't depend
on explicit preemption points for the voluntary models.

The series is based on Thomas' original proposal which he outlined
in [1], [2] and in his PoC [3].

An earlier RFC version is at [4].

Design
==

PREEMPT_AUTO works by always enabling CONFIG_PREEMPTION (and thus
PREEMPT_COUNT). This means that the scheduler can always safely
preempt. (This is identical to CONFIG_PREEMPT.)

Having that, the next step is to make the rescheduling policy dependent
on the chosen scheduling model. Currently, the scheduler uses a single
need-resched bit (TIF_NEED_RESCHED) which it uses to state that a
reschedule is needed.
PREEMPT_AUTO extends this by adding an additional need-resched bit
(TIF_NEED_RESCHED_LAZY) which, with TIF_NEED_RESCHED now allows the
scheduler to express two kinds of rescheduling intent: schedule at
the earliest opportunity (TIF_NEED_RESCHED), or express a need for
rescheduling while allowing the task on the runqueue to run to
timeslice completion (TIF_NEED_RESCHED_LAZY).

As mentioned above, the scheduler decides which need-resched bits are
chosen based on the preemption model in use:

	       TIF_NEED_RESCHED        TIF_NEED_RESCHED_LAZY

none		never   		always [*]
voluntary       higher sched class	other tasks [*]
full 		always                  never

[*] some details elided here.

The last part of the puzzle is, when does preemption happen, or
alternately stated, when are the need-resched bits checked:

                 exit-to-user    ret-to-kernel    preempt_count()

NEED_RESCHED_LAZY     Y               N                N
NEED_RESCHED          Y               Y                Y

Using NEED_RESCHED_LAZY allows for run-to-completion semantics when
none/voluntary preemption policies are in effect. And eager semantics
under full preemption.

In addition, since this is driven purely by the scheduler (not
depending on cond_resched() placement and the like), there is enough
flexibility in the scheduler to cope with edge cases -- ex. a kernel
task not relinquishing CPU under NEED_RESCHED_LAZY can be handled by
simply upgrading to a full NEED_RESCHED which can use more coercive
instruments like resched IPI to induce a context-switch.

Performance
==
The performance in the basic tests (perf bench sched messaging,
kernbench) is fairly close to what we see under PREEMPT_DYNAMIC.
(See patches 24, 25.)

Comparing stress-ng --cyclic latencies with a background kernel load
(stress-ng --mmap) serves as a good demonstration of how letting the
scheduler enforce priorities, tick exhaustion etc helps:

 PREEMPT_DYNAMIC, preempt=voluntary
   stress-ng: info:  [12252] setting to a 300 second (5 mins, 0.00 secs) run per stressor
   stress-ng: info:  [12252] dispatching hogs: 1 cyclic
   stress-ng: info:  [12253] cyclic: sched SCHED_DEADLINE: 100000 ns delay, 10000 samples
   stress-ng: info:  [12253] cyclic:   mean: 19973.46 ns, mode: 3560 ns
   stress-ng: info:  [12253] cyclic:   min: 2541 ns, max: 2751830 ns, std.dev. 68891.71
   stress-ng: info:  [12253] cyclic: latency percentiles:
   stress-ng: info:  [12253] cyclic:   25.00%:       4800 ns
   stress-ng: info:  [12253] cyclic:   50.00%:      12458 ns
   stress-ng: info:  [12253] cyclic:   75.00%:      25220 ns
   stress-ng: info:  [12253] cyclic:   90.00%:      35404 ns


 PREEMPT_AUTO, preempt=voluntary
   stress-ng: info:  [8883] setting to a 300 second (5 mins, 0.00 secs) run per stressor
   stress-ng: info:  [8883] dispatching hogs: 1 cyclic
   stress-ng: info:  [8884] cyclic: sched SCHED_DEADLINE: 100000 ns delay, 10000 samples
   stress-ng: info:  [8884] cyclic:   mean: 14169.08 ns, mode: 3355 ns
   stress-ng: info:  [8884] cyclic:   min: 2570 ns, max: 2234939 ns, std.dev. 66056.95
   stress-ng: info:  [8884] cyclic: latency percentiles:
   stress-ng: info:  [8884] cyclic:   25.00%:       3665 ns
   stress-ng: info:  [8884] cyclic:   50.00%:       5409 ns
   stress-ng: info:  [8884] cyclic:   75.00%:      16009 ns
   stress-ng: info:  [8884] cyclic:   90.00%:      24392 ns

Notice how much lower the 25/50/75/90 percentile latencies are for the
PREEMPT_AUTO case.
(See patch 26 for the full performance numbers.)


For a macro test, a colleague in Oracle's Exadata team tried two
OLTP benchmarks (on a 5.4.17 based Oracle kernel, with this series
backported.)

In both tests the data was cached on remote nodes (cells), and the
database nodes (compute) served client queries, with clients being
local in the first test and remote in the second.

Compute node: Oracle E5, dual socket AMD EPYC 9J14, KVM guest (380 CPUs)
Cells (11 nodes): Oracle E5, dual socket AMD EPYC 9334, 128 CPUs


				  PREEMPT_VOLUNTARY                        PREEMPT_AUTO
				                                        (preempt=voluntary)          
                              ==============================      =============================
                      clients  throughput    cpu-usage            throughput     cpu-usage         Gain
                               (tx/min)    (utime %/stime %)      (tx/min)    (utime %/stime %)
		      -------  ----------  -----------------      ----------  -----------------   -------
				                                            

  OLTP                  384     9,315,653     25/ 6                9,253,252       25/ 6            -0.7%
  benchmark	       1536    13,177,565     50/10               13,657,306       50/10            +3.6%
 (local clients)       3456    14,063,017     63/12               14,179,706       64/12            +0.8%


  OLTP                   96     8,973,985     17/ 2                8,924,926       17/ 2            -0.5%
  benchmark	        384    22,577,254     60/ 8               22,211,419       59/ 8            -1.6%
 (remote clients,      2304    25,882,857     82/11               25,536,100       82/11            -1.3%
  90/10 RW ratio)


(Both sets of tests have a fair amount of NW traffic since the query
tables etc are cached on the cells. Additionally, the first set,
given the local clients, stress the scheduler a bit more than the
second.)

The comparative performance for both the tests is fairly close,
more or less within a margin of error.

IMO the tests above (sched-messaging, kernbench, stress-ng, OLTP) show
that this scheduling model has legs. That said, the none/voluntary
models under PREEMPT_AUTO are conceptually different enough that there
likely are workloads where performance would be subpar. That needs
more extensive testing to figure out the weak points.


Series layout
==

Patch 1,
  "preempt: introduce CONFIG_PREEMPT_AUTO"
introduces the new scheduling model.

Patches 2-5,
  "thread_info: selector for TIF_NEED_RESCHED[_LAZY]",
  "thread_info: tif_need_resched() now takes resched_t as param",
  "sched: make test_*_tsk_thread_flag() return bool",
  "sched: *_tsk_need_resched() now takes resched_t as param"

introduce new thread_info/task helper interfaces or make changes to
pre-existing ones that will be used in the rest of the series.

Patches 6-9,
  "entry: handle lazy rescheduling at user-exit",
  "entry/kvm: handle lazy rescheduling at guest-entry",
  "entry: irqentry_exit only preempts for TIF_NEED_RESCHED",
  "sched: __schedule_loop() doesn't need to check for need_resched_lazy()"

make changes/document the rescheduling points.

Patches 10-11,
  "sched: separate PREEMPT_DYNAMIC config logic",
  "sched: runtime preemption config under PREEMPT_AUTO"

reuse the PREEMPT_DYNAMIC runtime configuration logic.

Patch 12-16,
  "rcu: limit PREEMPT_RCU to full preemption under PREEMPT_AUTO",
  "rcu: fix header guard for rcu_all_qs()",
  "preempt,rcu: warn on PREEMPT_RCU=n, preempt_model_full",
  "rcu: handle quiescent states for PREEMPT_RCU=n, PREEMPT_COUNT=y",
  "rcu: force context-switch for PREEMPT_RCU=n, PREEMPT_COUNT=y"

add RCU support.

Patch 17,
  "x86/thread_info: define TIF_NEED_RESCHED_LAZY"

adds x86 support. 

Note on platform support: this is x86 only for now. Howeer, supporting
architectures with !ARCH_NO_PREEMPT is straight-forward -- especially
if they support GENERIC_ENTRY.

Patches 18-21,
  "sched: prepare for lazy rescheduling in resched_curr()",
  "sched: default preemption policy for PREEMPT_AUTO",
  "sched: handle idle preemption for PREEMPT_AUTO",
  "sched: schedule eagerly in resched_cpu()"

are preparatory patches for adding PREEMPT_AUTO. Among other things
they add the default need-resched policy for !PREEMPT_AUTO,
PREEMPT_AUTO, and the idle task.

Patches 22-23,
  "sched/fair: refactor update_curr(), entity_tick()",
  "sched/fair: handle tick expiry under lazy preemption"

handle the 'hog' problem, where a kernel task does not voluntarily
schedule out.

And, finally patches 24-26,
  "sched: support preempt=none under PREEMPT_AUTO"
  "sched: support preempt=full under PREEMPT_AUTO"
  "sched: handle preempt=voluntary under PREEMPT_AUTO"

add support for the three preemption models.

Patch 27-30,
  "sched: latency warn for TIF_NEED_RESCHED_LAZY",
  "tracing: support lazy resched",
  "Documentation: tracing: add TIF_NEED_RESCHED_LAZY",
  "osnoise: handle quiescent states for PREEMPT_RCU=n, PREEMPTION=y"

handles remaining bits and pieces to do with TIF_NEED_RESCHED_LAZY.


Changelog
==

RFC:
 - Addresses review comments and is generally a more focused
   version of the RFC.
 - Lots of code reorganization.
 - Bugfixes all over.
 - need_resched() now only checks for TIF_NEED_RESCHED instead
   of TIF_NEED_RESCHED|TIF_NEED_RESCHED_LAZY.
 - set_nr_if_polling() now does not check for TIF_NEED_RESCHED_LAZY.
 - Tighten idle related checks.
 - RCU changes to force context-switches when a quiescent state is
   urgently needed.
 - Does not break live-patching anymore

Also at: github.com/terminus/linux preempt-v1

Please review.

Thanks
Ankur

[1] https://lore.kernel.org/lkml/87cyyfxd4k.ffs@tglx/
[2] https://lore.kernel.org/lkml/87led2wdj0.ffs@tglx/
[3] https://lore.kernel.org/lkml/87jzshhexi.ffs@tglx/
[4] https://lore.kernel.org/lkml/20231107215742.363031-1-ankur.a.arora@oracle.com/


Ankur Arora (30):
  preempt: introduce CONFIG_PREEMPT_AUTO
  thread_info: selector for TIF_NEED_RESCHED[_LAZY]
  thread_info: tif_need_resched() now takes resched_t as param
  sched: make test_*_tsk_thread_flag() return bool
  sched: *_tsk_need_resched() now takes resched_t as param
  entry: handle lazy rescheduling at user-exit
  entry/kvm: handle lazy rescheduling at guest-entry
  entry: irqentry_exit only preempts for TIF_NEED_RESCHED
  sched: __schedule_loop() doesn't need to check for need_resched_lazy()
  sched: separate PREEMPT_DYNAMIC config logic
  sched: runtime preemption config under PREEMPT_AUTO
  rcu: limit PREEMPT_RCU to full preemption under PREEMPT_AUTO
  rcu: fix header guard for rcu_all_qs()
  preempt,rcu: warn on PREEMPT_RCU=n, preempt_model_full
  rcu: handle quiescent states for PREEMPT_RCU=n, PREEMPT_COUNT=y
  rcu: force context-switch for PREEMPT_RCU=n, PREEMPT_COUNT=y
  x86/thread_info: define TIF_NEED_RESCHED_LAZY
  sched: prepare for lazy rescheduling in resched_curr()
  sched: default preemption policy for PREEMPT_AUTO
  sched: handle idle preemption for PREEMPT_AUTO
  sched: schedule eagerly in resched_cpu()
  sched/fair: refactor update_curr(), entity_tick()
  sched/fair: handle tick expiry under lazy preemption
  sched: support preempt=none under PREEMPT_AUTO
  sched: support preempt=full under PREEMPT_AUTO
  sched: handle preempt=voluntary under PREEMPT_AUTO
  sched: latency warn for TIF_NEED_RESCHED_LAZY
  tracing: support lazy resched
  Documentation: tracing: add TIF_NEED_RESCHED_LAZY
  osnoise: handle quiescent states for PREEMPT_RCU=n, PREEMPTION=y

 .../admin-guide/kernel-parameters.txt         |   1 +
 Documentation/trace/ftrace.rst                |   6 +-
 arch/s390/include/asm/preempt.h               |   4 +-
 arch/s390/mm/pfault.c                         |   2 +-
 arch/x86/Kconfig                              |   1 +
 arch/x86/include/asm/thread_info.h            |  10 +-
 drivers/acpi/processor_idle.c                 |   2 +-
 include/asm-generic/preempt.h                 |   4 +-
 include/linux/entry-common.h                  |   2 +-
 include/linux/entry-kvm.h                     |   2 +-
 include/linux/preempt.h                       |   2 +-
 include/linux/rcutree.h                       |   2 +-
 include/linux/sched.h                         |  43 ++-
 include/linux/sched/idle.h                    |   8 +-
 include/linux/thread_info.h                   |  57 +++-
 include/linux/trace_events.h                  |   6 +-
 init/Makefile                                 |   1 +
 kernel/Kconfig.preempt                        |  37 ++-
 kernel/entry/common.c                         |  12 +-
 kernel/entry/kvm.c                            |   4 +-
 kernel/rcu/Kconfig                            |   2 +-
 kernel/rcu/tiny.c                             |   2 +-
 kernel/rcu/tree.c                             |  17 +-
 kernel/rcu/tree_exp.h                         |   4 +-
 kernel/rcu/tree_plugin.h                      |  15 +-
 kernel/rcu/tree_stall.h                       |   2 +-
 kernel/sched/core.c                           | 311 ++++++++++++------
 kernel/sched/deadline.c                       |   6 +-
 kernel/sched/debug.c                          |  13 +-
 kernel/sched/fair.c                           |  56 ++--
 kernel/sched/idle.c                           |   4 +-
 kernel/sched/rt.c                             |   6 +-
 kernel/sched/sched.h                          |  27 +-
 kernel/trace/trace.c                          |   4 +-
 kernel/trace/trace_osnoise.c                  |  22 +-
 kernel/trace/trace_output.c                   |  16 +-
 36 files changed, 498 insertions(+), 215 deletions(-)

-- 
2.31.1