[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20231027175216.1424cd83@gandalf.local.home>
Date: Fri, 27 Oct 2023 17:52:16 -0400
From: Steven Rostedt <rostedt@...dmis.org>
To: LKML <linux-kernel@...r.kernel.org>
Cc: Thomas Gleixner <tglx@...utronix.de>,
Peter Zijlstra <peterz@...radead.org>,
Ankur Arora <ankur.a.arora@...cle.com>,
Linus Torvalds <torvalds@...ux-foundation.org>,
linux-mm@...ck.org, x86@...nel.org, akpm@...ux-foundation.org,
luto@...nel.org, bp@...en8.de, dave.hansen@...ux.intel.com,
hpa@...or.com, mingo@...hat.com, juri.lelli@...hat.com,
vincent.guittot@...aro.org, willy@...radead.org, mgorman@...e.de,
jon.grimm@....com, bharata@....com, raghavendra.kt@....com,
boris.ostrovsky@...cle.com, konrad.wilk@...cle.com,
jgross@...e.com, andrew.cooper3@...rix.com,
Joel Fernandes <joel@...lfernandes.org>,
Youssef Esmat <youssefesmat@...omium.org>,
Vineeth Pillai <vineethrp@...gle.com>,
Suleiman Souhlal <suleiman@...gle.com>,
Ingo Molnar <mingo@...nel.org>,
Daniel Bristot de Oliveira <bristot@...nel.org>,
Mathieu Desnoyers <mathieu.desnoyers@...icios.com>,
PostgreSQL Developers <pgsql-hackers@...ts.postgresql.org>,
Robert Haas <robertmhaas@...il.com>,
Sebastian Andrzej Siewior <bigeasy@...utronix.de>
Subject: Re: [POC][RFC][PATCH v2] sched: Extended Scheduler Time Slice
As this change showed a really good improvement in my micro-benchmark, I
decided to give it a try on something that's a bit more of a real world
benchmark: PostgreSQL.
Hence, I'm also Cc'ing the postgres maintainers in case they are interested.
To bring them up to speed, here's what's been going on:
A new feature is being discussed to add NEED_RESCHED_LAZY that will allow us
(the kernel developers) to remove CONFIG_PREEMPT_NONE and
CONFIG_PREEMPT_VOLUNTARY, and move that to user space runtime switches. The
idea is that when the scheduler tick goes off and it's time to schedule out a
SCHED_OTHER task, user space can have the option of not doing that in the
kernel and wait till it enters back into user space before scheduling
(simulating PREEMPT_NONE). The scheduler tick will set NEED_RESCHED_LAZY
(instead of NEED_RESCHED which always schedules when possible), and that bit
will only be looked at when exiting back into user space, where it will
perform the schedule if it is set.
My idea is to extend this into user space as well. Using the restartable
sequences infrastructure (https://lwn.net/Articles/697979/) that maps memory
between kernel and user space for threads, I'll use two bits (or one bit and a
counter, but that's for later, I'll just discuss the current implementation).
bit 0: is set by user space to tell the kernel that it's in a critical
section.
bit 1: is set by the kernel telling user space that it granted it a bit more
time and that it should call back into the kernel with any system call
(sched_yield() or gettid()), when it is out of its critical section.
Bit 1 will never be set if bit 0 is not set (Note, there's talk about making
bit 0 the one set by the kernel, or use a different word entirely to allow
the rest of the bits to be used as a counter for nested critical sections).
Now when returning back to user space, if the critical section bit (or
counter) is set, then it will not call schedule when NEED_RESCHED_LAZY is set.
Note that it will still always call schedule on NEED_RESCHED. This gives user
space one more tick (1 ms with 1000 HZ kernel config, to 4 ms with 250 HZ
kernel config). When user space is done with its critical section, it should
check the bit that can be set by the kernel to see if it should then schedule.
If the user side bit is not cleared after a tick, then the kernel will set
NEED_RESCHED which will force a schedule no matter where user space happens to
be. Note, this could also hurt that task in that the scheduler will take away
the eligibility of that task to balance out the amount of extra time the task
ran for, not to mention, the force schedule could now land in a critical
section.
Back in 2014 at the Linux Collaboration Summit in Napa Valley I had a nice
conversation with Robet Haas about user space spin locks. He told me how they
are used in PostgreSQL where futexes did not meet their needs. This
conversation kicked off the idea about implementing user space adaptive spin
locks (which last year in Belgium, I asked André Almeida to implement -
https://lwn.net/Articles/931789/).
Even though user space adaptive spinning would greatly help out the contention
of a lock, there's still the issue of a lock owner being preempted which would
cause all those waiters to also go into the kernel and delay access to the
critical sections. In the real time kernel, this was solved by
NEED_RESCHED_LAZY:
https://git.kernel.org/pub/scm/linux/kernel/git/rt/linux-rt-devel.git/tree/patches/preempt-lazy-support.patch?h=v5.9-rc2-rt1-patches
Now Thomas has proposed using a similar solution to solve the PREEMPT_NONE /
VOLUNTARY issue.
https://lore.kernel.org/lkml/87cyyfxd4k.ffs@tglx/
Which now has a prototype in the rt-devel tree:
https://git.kernel.org/pub/scm/linux/kernel/git/rt/linux-rt-devel.git/tree/patches/PREEMPT_AUTO.patch?h=v6.6-rc6-rt10-patches
For which I applied to v6.6-rc4 (my current working branch), and applied the
solution I explained above (POC with debugging still in it):
https://lore.kernel.org/all/20231025235413.597287e1@gandalf.local.home/
Now I downloaded the latest postgres from:
https://github.com/postgres/postgres.git
And built sha: 26f988212eada9c586223cbbf876c7eb455044d9
After installing it, I rebooted the machine with the updated kernel (requires
CONFIG_PREEMPT_AUTO being set), and ran the unmodified version of postgres
pgbench test:
pgbench -c 100 -T 300 -j 8 -S -n
I ran it 16 times and looked at the transactions per second counter (tps).
Note, I threw out the first run as it had horrible numbers probably due to
everything in cold cache (memory and file system).
Then I applied the below patch, did a make clean, make install, rebooted the
box again and ran the test for another 16 times (again, the first run was
horrible).
Here are the results of the tests: I only used the 15 runs after the first run
for comparisons.
Without the patched postgres executable:
First run:
tps = 72573.188203 (without initial connection time)
15 runs:
tps = 74315.731978 (without initial connection time)
tps = 74448.130108 (without initial connection time)
tps = 74662.246100 (without initial connection time)
tps = 73124.961311 (without initial connection time)
tps = 74653.611878 (without initial connection time)
tps = 74765.296134 (without initial connection time)
tps = 74497.066104 (without initial connection time)
tps = 74541.664031 (without initial connection time)
tps = 74595.032066 (without initial connection time)
tps = 74545.876793 (without initial connection time)
tps = 74762.560651 (without initial connection time)
tps = 74528.657018 (without initial connection time)
tps = 74814.700753 (without initial connection time)
tps = 74687.980967 (without initial connection time)
tps = 74973.185122 (without initial connection time)
With the patched postgres executable:
First run:
tps = 73562.005970 (without initial connection time)
15 runs:
tps = 74560.101322 (without initial connection time)
tps = 74711.177071 (without initial connection time)
tps = 74551.093281 (without initial connection time)
tps = 74559.452628 (without initial connection time)
tps = 74737.604361 (without initial connection time)
tps = 74521.606019 (without initial connection time)
tps = 74870.859166 (without initial connection time)
tps = 74545.423471 (without initial connection time)
tps = 74805.939815 (without initial connection time)
tps = 74665.240730 (without initial connection time)
tps = 74701.479550 (without initial connection time)
tps = 74897.154079 (without initial connection time)
tps = 74879.687067 (without initial connection time)
tps = 74792.563116 (without initial connection time)
tps = 74852.101317 (without initial connection time)
Without the patch:
Average: 74527.7800
Std Dev: 420.6304
With the patch:
Average: 74710.0988
Std Dev: 136.7250
Notes about my setup. I ran this on one of my older test boxes (pretty much the
last of the bare metal machines I test on, as now I do most on VMs, but did not
want to run these tests on VMs).
It's a 4 core (2 hyper threaded) total of 8 CPUs:
Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz
32G of RAM.
Postgres has several types of locks. I first applied the extend()/unextend() to
only the raw spin locks but that didn't make much difference. When I traced
it, I found that the time slice seldom landed when one of these spin locks
were held. 10 or so during a 5 minute run (I added writes to the kernel
tracing buffer via trace_marker to know when the locks were held, and also
recorded sched_switch to see if it was ever preempted). So I expanded the
usage to the "light weight locks" (lwlocks), which are similar to the spin
locks but does some complex backing off. Basically a heuristic spin. Anyway,
after adding the logic to these locks, it definitely made a difference. Not a
huge one, but it was noticeable beyond the noise. I can imagine that if this
was implemented on a machine with many more CPUs than 8, it would make an even
bigger difference.
I also had to rerun my tests because I left some kernel config options enabled
that affected performance. I didn't want that to skew the results. But the
results were similar, except that with the slower kernel, the worse
performance with the patch was better than the best performance without it.
Not by much, but still. After removing the slowdown, that was no longer the
case. But I noticed that with the patch, the standard deviation was much
smaller than without the patch. I'm guessing that without the patch it depends
on how the scheduler interacts with the locking much more and a good run
without the patch was just the test being "lucky" that it didn't get preempted
as much in a critical section.
I personally think the smaller standard deviation is a win as it makes the
database run with a more deterministic behavior.
Anyway, I'd really like to know what others think about this, and perhaps they
can run this on their own testing infrastructure. All the code is available to
reproduce. If you want to reproduce it like I did. Checkout the latest Linus
tree (or do what I have which is v6.6-rc4), apply the above mentioned
PREEMPT_AUTO.patch, then apply my kernel patch. Select CONFIG_PREEMPT_AUTO and
build the kernel. Don't worry about that big banner that is on the kernel
console at boot up telling you that you are running a debug kernel. You are
running one, and it's because I still have a couple of trace_printk()s in it.
Run your postgres tests with and without the patch and let me know if there's a
difference. Only if you think it's worth it. Let me know if you don't ;-)
-- Steve
[ The biggest part of the below change is adding in the standard rseq_abi.h header ]
diff --git a/src/backend/storage/lmgr/lwlock.c b/src/backend/storage/lmgr/lwlock.c
index 315a78cda9..652d3a5560 100644
--- a/src/backend/storage/lmgr/lwlock.c
+++ b/src/backend/storage/lmgr/lwlock.c
@@ -89,11 +89,12 @@
#include "storage/spin.h"
#include "utils/memutils.h"
+#include "storage/rseq-abi.h"
+
#ifdef LWLOCK_STATS
#include "utils/hsearch.h"
#endif
-
/* We use the ShmemLock spinlock to protect LWLockCounter */
extern slock_t *ShmemLock;
@@ -841,6 +842,8 @@ LWLockAttemptLock(LWLock *lock, LWLockMode mode)
desired_state += LW_VAL_SHARED;
}
+ if (lock_free)
+ extend();
/*
* Attempt to swap in the state we are expecting. If we didn't see
* lock to be free, that's just the old value. If we saw it as free,
@@ -863,9 +866,14 @@ LWLockAttemptLock(LWLock *lock, LWLockMode mode)
#endif
return false;
}
- else
+ else {
+ if (lock_free)
+ unextend();
return true; /* somebody else has the lock */
+ }
}
+ if (lock_free)
+ unextend();
}
pg_unreachable();
}
@@ -1868,6 +1876,7 @@ LWLockRelease(LWLock *lock)
LWLockWakeup(lock);
}
+ unextend();
/*
* Now okay to allow cancel/die interrupts.
*/
diff --git a/src/backend/storage/lmgr/s_lock.c b/src/backend/storage/lmgr/s_lock.c
index 327ac64f7c..c22310cfe3 100644
--- a/src/backend/storage/lmgr/s_lock.c
+++ b/src/backend/storage/lmgr/s_lock.c
@@ -55,6 +55,8 @@
#include "storage/s_lock.h"
#include "utils/wait_event.h"
+#include "storage/rseq-abi.h"
+
#define MIN_SPINS_PER_DELAY 10
#define MAX_SPINS_PER_DELAY 1000
#define NUM_DELAYS 1000
@@ -66,7 +68,6 @@ slock_t dummy_spinlock;
static int spins_per_delay = DEFAULT_SPINS_PER_DELAY;
-
/*
* s_lock_stuck() - complain about a stuck spinlock
*/
@@ -94,6 +95,8 @@ s_lock(volatile slock_t *lock, const char *file, int line, const char *func)
{
SpinDelayStatus delayStatus;
+ unextend();
+
init_spin_delay(&delayStatus, file, line, func);
while (TAS_SPIN(lock))
@@ -102,6 +105,7 @@ s_lock(volatile slock_t *lock, const char *file, int line, const char *func)
}
finish_spin_delay(&delayStatus);
+ extend();
return delayStatus.delays;
}
diff --git a/src/include/storage/rseq-abi.h b/src/include/storage/rseq-abi.h
new file mode 100644
index 0000000000..b858cf1d6f
--- /dev/null
+++ b/src/include/storage/rseq-abi.h
@@ -0,0 +1,174 @@
+/* SPDX-License-Identifier: GPL-2.0+ WITH Linux-syscall-note */
+#ifndef _RSEQ_ABI_H
+#define _RSEQ_ABI_H
+
+/*
+ * rseq-abi.h
+ *
+ * Restartable sequences system call API
+ *
+ * Copyright (c) 2015-2022 Mathieu Desnoyers <mathieu.desnoyers@...icios.com>
+ */
+
+#include <ctype.h>
+#include <asm/types.h>
+
+enum rseq_abi_cpu_id_state {
+ RSEQ_ABI_CPU_ID_UNINITIALIZED = -1,
+ RSEQ_ABI_CPU_ID_REGISTRATION_FAILED = -2,
+};
+
+enum rseq_abi_flags {
+ RSEQ_ABI_FLAG_UNREGISTER = (1 << 0),
+};
+
+enum rseq_abi_cs_flags_bit {
+ RSEQ_ABI_CS_FLAG_NO_RESTART_ON_PREEMPT_BIT = 0,
+ RSEQ_ABI_CS_FLAG_NO_RESTART_ON_SIGNAL_BIT = 1,
+ RSEQ_ABI_CS_FLAG_NO_RESTART_ON_MIGRATE_BIT = 2,
+};
+
+enum rseq_abi_cs_flags {
+ RSEQ_ABI_CS_FLAG_NO_RESTART_ON_PREEMPT =
+ (1U << RSEQ_ABI_CS_FLAG_NO_RESTART_ON_PREEMPT_BIT),
+ RSEQ_ABI_CS_FLAG_NO_RESTART_ON_SIGNAL =
+ (1U << RSEQ_ABI_CS_FLAG_NO_RESTART_ON_SIGNAL_BIT),
+ RSEQ_ABI_CS_FLAG_NO_RESTART_ON_MIGRATE =
+ (1U << RSEQ_ABI_CS_FLAG_NO_RESTART_ON_MIGRATE_BIT),
+};
+
+/*
+ * struct rseq_abi_cs is aligned on 4 * 8 bytes to ensure it is always
+ * contained within a single cache-line. It is usually declared as
+ * link-time constant data.
+ */
+struct rseq_abi_cs {
+ /* Version of this structure. */
+ __u32 version;
+ /* enum rseq_abi_cs_flags */
+ __u32 flags;
+ __u64 start_ip;
+ /* Offset from start_ip. */
+ __u64 post_commit_offset;
+ __u64 abort_ip;
+} __attribute__((aligned(4 * sizeof(__u64))));
+
+/*
+ * struct rseq_abi is aligned on 4 * 8 bytes to ensure it is always
+ * contained within a single cache-line.
+ *
+ * A single struct rseq_abi per thread is allowed.
+ */
+struct rseq_abi {
+ /*
+ * Restartable sequences cpu_id_start field. Updated by the
+ * kernel. Read by user-space with single-copy atomicity
+ * semantics. This field should only be read by the thread which
+ * registered this data structure. Aligned on 32-bit. Always
+ * contains a value in the range of possible CPUs, although the
+ * value may not be the actual current CPU (e.g. if rseq is not
+ * initialized). This CPU number value should always be compared
+ * against the value of the cpu_id field before performing a rseq
+ * commit or returning a value read from a data structure indexed
+ * using the cpu_id_start value.
+ */
+ __u32 cpu_id_start;
+ /*
+ * Restartable sequences cpu_id field. Updated by the kernel.
+ * Read by user-space with single-copy atomicity semantics. This
+ * field should only be read by the thread which registered this
+ * data structure. Aligned on 32-bit. Values
+ * RSEQ_CPU_ID_UNINITIALIZED and RSEQ_CPU_ID_REGISTRATION_FAILED
+ * have a special semantic: the former means "rseq uninitialized",
+ * and latter means "rseq initialization failed". This value is
+ * meant to be read within rseq critical sections and compared
+ * with the cpu_id_start value previously read, before performing
+ * the commit instruction, or read and compared with the
+ * cpu_id_start value before returning a value loaded from a data
+ * structure indexed using the cpu_id_start value.
+ */
+ __u32 cpu_id;
+ /*
+ * Restartable sequences rseq_cs field.
+ *
+ * Contains NULL when no critical section is active for the current
+ * thread, or holds a pointer to the currently active struct rseq_cs.
+ *
+ * Updated by user-space, which sets the address of the currently
+ * active rseq_cs at the beginning of assembly instruction sequence
+ * block, and set to NULL by the kernel when it restarts an assembly
+ * instruction sequence block, as well as when the kernel detects that
+ * it is preempting or delivering a signal outside of the range
+ * targeted by the rseq_cs. Also needs to be set to NULL by user-space
+ * before reclaiming memory that contains the targeted struct rseq_cs.
+ *
+ * Read and set by the kernel. Set by user-space with single-copy
+ * atomicity semantics. This field should only be updated by the
+ * thread which registered this data structure. Aligned on 64-bit.
+ */
+ union {
+ __u64 ptr64;
+
+ /*
+ * The "arch" field provides architecture accessor for
+ * the ptr field based on architecture pointer size and
+ * endianness.
+ */
+ struct {
+#ifdef __LP64__
+ __u64 ptr;
+#elif defined(__BYTE_ORDER) ? (__BYTE_ORDER == __BIG_ENDIAN) : defined(__BIG_ENDIAN)
+ __u32 padding; /* Initialized to zero. */
+ __u32 ptr;
+#else
+ __u32 ptr;
+ __u32 padding; /* Initialized to zero. */
+#endif
+ } arch;
+ } rseq_cs;
+
+ /*
+ * Restartable sequences flags field.
+ *
+ * This field should only be updated by the thread which
+ * registered this data structure. Read by the kernel.
+ * Mainly used for single-stepping through rseq critical sections
+ * with debuggers.
+ *
+ * - RSEQ_ABI_CS_FLAG_NO_RESTART_ON_PREEMPT
+ * Inhibit instruction sequence block restart on preemption
+ * for this thread.
+ * - RSEQ_ABI_CS_FLAG_NO_RESTART_ON_SIGNAL
+ * Inhibit instruction sequence block restart on signal
+ * delivery for this thread.
+ * - RSEQ_ABI_CS_FLAG_NO_RESTART_ON_MIGRATE
+ * Inhibit instruction sequence block restart on migration for
+ * this thread.
+ */
+ __u32 flags;
+
+ /*
+ * Restartable sequences node_id field. Updated by the kernel. Read by
+ * user-space with single-copy atomicity semantics. This field should
+ * only be read by the thread which registered this data structure.
+ * Aligned on 32-bit. Contains the current NUMA node ID.
+ */
+ __u32 node_id;
+
+ /*
+ * Restartable sequences mm_cid field. Updated by the kernel. Read by
+ * user-space with single-copy atomicity semantics. This field should
+ * only be read by the thread which registered this data structure.
+ * Aligned on 32-bit. Contains the current thread's concurrency ID
+ * (allocated uniquely within a memory map).
+ */
+ __u32 mm_cid;
+
+ __u32 cr_flags;
+ /*
+ * Flexible array member at end of structure, after last feature field.
+ */
+ char end[];
+} __attribute__((aligned(4 * sizeof(__u64))));
+
+#endif /* _RSEQ_ABI_H */
diff --git a/src/include/storage/s_lock.h b/src/include/storage/s_lock.h
index c9fa84cc43..999a056c6f 100644
--- a/src/include/storage/s_lock.h
+++ b/src/include/storage/s_lock.h
@@ -96,6 +96,10 @@
#ifndef S_LOCK_H
#define S_LOCK_H
+#define __GNU_SOURCE
+#include <unistd.h>
+#include "storage/rseq-abi.h"
+
#ifdef FRONTEND
#error "s_lock.h may not be included from frontend code"
#endif
@@ -131,6 +135,53 @@
*----------
*/
+extern ptrdiff_t __rseq_offset;
+extern unsigned int __rseq_size;
+
+static inline unsigned clear_extend(volatile unsigned *ptr)
+{
+ unsigned ret;
+
+ asm volatile("andb %b1,%0"
+ : "+m" (*(volatile char *)ptr)
+ : "iq" (~0x1)
+ : "memory");
+
+ ret = *ptr;
+ *ptr = 0;
+
+ return ret;
+}
+
+static inline void extend(void)
+{
+ struct rseq_abi *rseq_ptr;
+
+ if (!__rseq_size)
+ return;
+
+ rseq_ptr = (void *)((unsigned long)__builtin_thread_pointer() + __rseq_offset);
+ rseq_ptr->cr_flags = 1;
+}
+
+static inline void unextend(void)
+{
+ struct rseq_abi *rseq_ptr;
+ unsigned prev;
+
+ if (!__rseq_size)
+ return;
+
+ rseq_ptr = (void *)((unsigned long)__builtin_thread_pointer() + __rseq_offset);
+
+ prev = clear_extend(&rseq_ptr->cr_flags);
+ if (prev & 2) {
+ gettid();
+ }
+}
+
+#define __S_UNLOCK(lock) do { S_UNLOCK(lock); unextend(); } while (0)
+#define __S_LOCK(lock) do { extend(); S_LOCK(lock); } while (0)
#ifdef __i386__ /* 32-bit i386 */
#define HAS_TEST_AND_SET
diff --git a/src/include/storage/spin.h b/src/include/storage/spin.h
index 5d809cc980..4230025748 100644
--- a/src/include/storage/spin.h
+++ b/src/include/storage/spin.h
@@ -59,9 +59,9 @@
#define SpinLockInit(lock) S_INIT_LOCK(lock)
-#define SpinLockAcquire(lock) S_LOCK(lock)
+#define SpinLockAcquire(lock) __S_LOCK(lock)
-#define SpinLockRelease(lock) S_UNLOCK(lock)
+#define SpinLockRelease(lock) __S_UNLOCK(lock)
#define SpinLockFree(lock) S_LOCK_FREE(lock)
Powered by blists - more mailing lists