linux-kernel - [PATCH 0/5] sched_ext: Support high-performance monotonically non-decreasing clock

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [thread-next>] [day] [month] [year] [list]

Message-ID: <20241116160126.29454-1-changwoo@igalia.com>
Date: Sun, 17 Nov 2024 01:01:21 +0900
From: Changwoo Min <multics69@...il.com>
To: tj@...nel.org,
	void@...ifault.com
Cc: mingo@...hat.com,
	peterz@...radead.org,
	changwoo@...lia.com,
	kernel-dev@...lia.com,
	linux-kernel@...r.kernel.org
Subject: [PATCH 0/5] sched_ext: Support high-performance monotonically non-decreasing clock

Many BPF schedulers (such as scx_lavd, scx_rusty, scx_bpfland)
frequently call bpf_ktime_get_ns() for tracking tasks' runtime
properties. If supported, bpf_ktime_get_ns() eventually reads a hardware
timestamp counter (TSC). However, reading a hardware TSC is not
performant in some hardware platforms, degrading IPC.

This patchset addresses the performance problem of reading hardware TSC
by leveraging the rq clock in the scheduler core, introducing a
scx_bpf_clock_get_ns() function for BPF schedulers. Whenever the rq clock
is fresh enough, scx_bpf_clock_get_ns() provides the rq clock, which is
already updated by the scheduler core (update_rq_clock), so it can reduce
the reading TSC calls.

When the rq lock is released (rq_unpin_lock) or a long-running
operations are done by the BPF scheduler (ops.running, ops.update_idle),
the rq clock is invalidated, so a subsequent scx_bpf_clock_get_ns() call
gets the fresh sched_clock for the caller.

In addition, scx_bpf_clock_get_ns() guarantees the clock is
monotonically non-decreasing for the same CPU, so the clock cannot go
backward in the same CPU.

Using scx_bpf_clock_get_ns() reduces the number of reading hardware TSC
by 40-70% (65% for scx_lavd, 58% for scx_bpfland, and 43% for scx_rusty)
for the following benchmark.

    perf bench -f simple sched messaging -t -g 20 -l 6000

The patchset begins by managing the status of rq clock in the scheduler
core, then implementing scx_bpf_clock_get_ns(), and finally applying it
to the BPF schedulers.

Changwoo Min (5):
  sched_ext: Implement scx_rq_clock_update/stale()
  sched_ext: Manage the validity of scx_rq_clock
  sched_ext: Implement scx_bpf_clock_get_ns()
  sched_ext: Add scx_bpf_clock_get_ns() for BPF scheduler
  sched_ext: Replace bpf_ktime_get_ns() to scx_bpf_clock_get_ns()

 kernel/sched/core.c                      |  6 +-
 kernel/sched/ext.c                       | 74 ++++++++++++++++++++++++
 kernel/sched/sched.h                     | 22 ++++++-
 tools/sched_ext/include/scx/common.bpf.h |  1 +
 tools/sched_ext/include/scx/compat.bpf.h |  5 ++
 tools/sched_ext/scx_central.bpf.c        |  4 +-
 tools/sched_ext/scx_flatcg.bpf.c         |  2 +-
 7 files changed, 109 insertions(+), 5 deletions(-)

-- 
2.47.0