linux-kernel - [RFC PATCH] locking/percpu-rwsem: do not do lock handoff in percpu_up

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <xm26zfwx7z5p.fsf@google.com>
Date: Mon, 22 Jan 2024 14:59:14 -0800
From: Benjamin Segall <bsegall@...gle.com>
To: Peter Zijlstra <peterz@...radead.org>, Ingo Molnar <mingo@...hat.com>,
 Will Deacon <will@...nel.org>, Waiman Long <longman@...hat.com>, Boqun
 Feng <boqun.feng@...il.com>
Cc: linux-kernel@...r.kernel.org
Subject: [RFC PATCH] locking/percpu-rwsem: do not do lock handoff in
 percpu_up_write

The waitq wakeup in percpu_up_write necessarily runs the wake function
immediately in the current thread. With it calling
__percpu_rwsem_trylock on behalf of the thread being woken, the lock is
extremely fair and FIFO, with the window for unfairness merely being the
time between the release of sem->block and the completion of a relevant
trylock.

However, the woken threads that now hold the lock may not be chosen to
run for a long time, and it would be useful to have more of this window
available for a currently running thread to unfairly take the lock
immediately and use it. This can result in priority-inversion issues
with high contention or things like CFS_BANDWIDTH quotas.

The even older version of percpu_rwsem that used an rwsem at its core
provided some related gains in a different fashion through
RWSEM_SPIN_ON_OWNER; while it had a similarly small window, waiting
writers could spin, making it far more likely that a thread would hit
this window.

Signed-off-by: Ben Segall <bsegall@...gle.com>

---

So the actual problem we saw was that one job had severe slowdowns
during startup with certain other jobs on the machine, and the slowdowns
turned out to be some cgroup moves it did during startup. The antagonist
jobs were spawning huge numbers of threads and some other internal bugs
were exacerbating their contention. The lock handoff meant that a batch
of antagonist threads would receive the read lock of
cgroup_threadgroup_rwsem and at least some of those threads would take a
long time to be scheduled.

As a totally artificial test, you can see occasional latencies up
to multiple seconds on moving singlethreaded processes with
cgroup.procs, with an antagonist cpu cgroup of thousands of cpusoakers
and tasks doing (pthread_create; pthread_join) or fork/join. A simple
antagonist is just a cpu+cpuset cgroup and:

bash -c 'enter cgroup; for i in {1..4000}; do bash -c "while true; do :; done&"; done' &
bash -c 'enter cgroup; for i in {1..4000}; do bash -c "while true; do /bin/true; done&"; done' &

favordynmods improves the average cost of the migrate, but doesn't help
much with the spikes. This patch does.

The locktorture results under the current locktorture are mixed (10m
runtime, 72 cpu machine, default settings):

With this patch:
Writes:  Total: 56094  Max/Min: 922/680   Fail: 0 
Reads :  Total: 21522  Max/Min: 305/292   Fail: 0

Baseline:
Writes:  Total: 33801  Max/Min: 472/468   Fail: 0
Reads :  Total: 33649  Max/Min: 475/463   Fail: 0

Under the old locktorture I originally tested against (or patching it
back in), where torture_rwsem_write_delay basically did
"mdelay(long_hold / 10)" when it wasn't doing the long hold, the results
were more clearly beneficial (there's actual noticeable variance in this
config, so I have stats and an example run near the median):

Baseline:
Writes:  Total: 11852  Max/Min: 167/164   Fail: 0
Reads :  Total: 11858  Max/Min: 169/164   Fail: 0
    N           Min           Max        Median           Avg        Stddev
R  50         10297         15142         11690      11951.72     1437.2562
W  50         10296         15144         11692      11952.04     1437.5479

Patched:
Writes:  Total: 27624  Max/Min: 442/334   Fail: 0
Reads :  Total: 13957  Max/Min: 197/192   Fail: 0
    N           Min           Max        Median           Avg        Stddev
R  30         13725         17133         13902     14458.467      1199.125
W  30         27525         28880         27683     27844.533     372.82701


Which of these locktorture setups is a more useful test in general I
don't know, the patch removing the mdelay ("locktorture: Add long_hold
to adjust lock-hold delays") left in some of the shortdelay type
mdelay/udelay calls for other locks.

Of course, whenever proxy execution lands, the downsides of the current
nearly 100% fifo behavior will be gone.


---
 kernel/locking/percpu-rwsem.c | 55 +++++++++++++++++++++--------------
 1 file changed, 33 insertions(+), 22 deletions(-)

diff --git a/kernel/locking/percpu-rwsem.c b/kernel/locking/percpu-rwsem.c
index 185bd1c906b01..251c1a7a94e40 100644
--- a/kernel/locking/percpu-rwsem.c
+++ b/kernel/locking/percpu-rwsem.c
@@ -112,22 +112,22 @@ static bool __percpu_rwsem_trylock(struct percpu_rw_semaphore *sem, bool reader)
  *
  * We use EXCLUSIVE for both readers and writers to preserve FIFO order,
  * and play games with the return value to allow waking multiple readers.
  *
  * Specifically, we wake readers until we've woken a single writer, or until a
- * trylock fails.
+ * check of the sem->block value fails.
  */
 static int percpu_rwsem_wake_function(struct wait_queue_entry *wq_entry,
 				      unsigned int mode, int wake_flags,
 				      void *key)
 {
 	bool reader = wq_entry->flags & WQ_FLAG_CUSTOM;
 	struct percpu_rw_semaphore *sem = key;
 	struct task_struct *p;
 
-	/* concurrent against percpu_down_write(), can get stolen */
-	if (!__percpu_rwsem_trylock(sem, reader))
+	/* racy, just an optimization to stop waking if the lock is taken */
+	if (atomic_read(&sem->block))
 		return 1;
 
 	p = get_task_struct(wq_entry->private);
 	list_del_init(&wq_entry->entry);
 	smp_store_release(&wq_entry->private, NULL);
@@ -138,32 +138,44 @@ static int percpu_rwsem_wake_function(struct wait_queue_entry *wq_entry,
 	return !reader; /* wake (readers until) 1 writer */
 }
 
 static void percpu_rwsem_wait(struct percpu_rw_semaphore *sem, bool reader)
 {
-	DEFINE_WAIT_FUNC(wq_entry, percpu_rwsem_wake_function);
 	bool wait;
+	bool first = true;
 
-	spin_lock_irq(&sem->waiters.lock);
-	/*
-	 * Serialize against the wakeup in percpu_up_write(), if we fail
-	 * the trylock, the wakeup must see us on the list.
-	 */
-	wait = !__percpu_rwsem_trylock(sem, reader);
-	if (wait) {
-		wq_entry.flags |= WQ_FLAG_EXCLUSIVE | reader * WQ_FLAG_CUSTOM;
-		__add_wait_queue_entry_tail(&sem->waiters, &wq_entry);
-	}
-	spin_unlock_irq(&sem->waiters.lock);
+	do {
+		DEFINE_WAIT_FUNC(wq_entry, percpu_rwsem_wake_function);
 
-	while (wait) {
-		set_current_state(TASK_UNINTERRUPTIBLE);
-		if (!smp_load_acquire(&wq_entry.private))
-			break;
-		schedule();
-	}
-	__set_current_state(TASK_RUNNING);
+		spin_lock_irq(&sem->waiters.lock);
+		/*
+		 * Serialize against the wakeup in percpu_up_write(), if we fail
+		 * the trylock, the wakeup must see us on the list.
+		 */
+		wait = !__percpu_rwsem_trylock(sem, reader);
+		if (wait) {
+			wq_entry.flags |= WQ_FLAG_EXCLUSIVE | reader * WQ_FLAG_CUSTOM;
+			/*
+			 * If we wake but lose a race for the lock, preserve
+			 * FIFO-ish behavior by skipping the queue
+			 */
+			if (first)
+				__add_wait_queue_entry_tail(&sem->waiters, &wq_entry);
+			else
+				__add_wait_queue(&sem->waiters, &wq_entry);
+			first = false;
+		}
+		spin_unlock_irq(&sem->waiters.lock);
+
+		while (wait) {
+			set_current_state(TASK_UNINTERRUPTIBLE);
+			if (!smp_load_acquire(&wq_entry.private))
+				break;
+			schedule();
+		}
+		__set_current_state(TASK_RUNNING);
+	} while (wait && !__percpu_rwsem_trylock(sem, reader));
 }
 
 bool __sched __percpu_down_read(struct percpu_rw_semaphore *sem, bool try)
 {
 	if (__percpu_down_read_trylock(sem))
-- 
2.43.0.429.g432eaa2c6b-goog