linux-kernel - [PATCH v3] locking/rwsem: reduce spinlock contention in wakeup after up_read/up

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-Id: <1429898069-28907-1-git-send-email-Waiman.Long@hp.com>
Date:	Fri, 24 Apr 2015 13:54:29 -0400
From:	Waiman Long <Waiman.Long@...com>
To:	Peter Zijlstra <peterz@...radead.org>,
	Ingo Molnar <mingo@...nel.org>
Cc:	linux-kernel@...r.kernel.org, Jason Low <jason.low2@...com>,
	Davidlohr Bueso <dave@...olabs.net>,
	Scott J Norton <scott.norton@...com>,
	Douglas Hatch <doug.hatch@...com>,
	Waiman Long <Waiman.Long@...com>
Subject: [PATCH v3] locking/rwsem: reduce spinlock contention in wakeup after up_read/up_write

In up_write()/up_read(), rwsem_wake() will be called whenever it
detects that some writers/readers are waiting. The rwsem_wake()
function will take the wait_lock and call __rwsem_do_wake() to do the
real wakeup.  For a heavily contended rwsem, doing a spin_lock() on
wait_lock will cause further contention on the heavily contended rwsem
cacheline resulting in delay in the completion of the up_read/up_write
operations.

This patch makes the wait_lock taking and the call to __rwsem_do_wake()
optional if at least one spinning writer is present. The spinning
writer will be able to take the rwsem and call rwsem_wake() later
when it calls up_write(). With the presence of a spinning writer,
rwsem_wake() will now try to acquire the lock using trylock. If that
fails, it will just quit.

This patch also checks one more time in __rwsem_do_wake() to see if
the rwsem was stolen just before doing the expensive wakeup operation
which will be unnecessary if the lock was stolen.

On an 8-socket Westmere-EX server (80 cores, HT off), running AIM7's
high_systime workload (1000 users) on a vanilla 4.0 kernel produced
the following perf profile for spinlock contention:

  9.23%    reaim  [kernel.kallsyms]    [k] _raw_spin_lock_irqsave
              |--97.39%-- rwsem_wake
              |--0.69%-- try_to_wake_up
              |--0.52%-- release_pages
               --1.40%-- [...]

  1.70%    reaim  [kernel.kallsyms]    [k] _raw_spin_lock_irq
              |--96.61%-- rwsem_down_write_failed
              |--2.03%-- __schedule
              |--0.50%-- run_timer_softirq
               --0.86%-- [...]

Here the contended rwsems are the mmap_sem (mm_struct) and the
i_mmap_rwsem (address_space) with mostly write locking.  With a
patched 4.0 kernel, the perf profile became:

  1.87%    reaim  [kernel.kallsyms]    [k] _raw_spin_lock_irqsave
              |--87.64%-- rwsem_wake
              |--2.80%-- release_pages
              |--2.56%-- try_to_wake_up
              |--1.10%-- __wake_up
              |--1.06%-- pagevec_lru_move_fn
              |--0.93%-- prepare_to_wait_exclusive
              |--0.71%-- free_pid
              |--0.58%-- get_page_from_freelist
              |--0.57%-- add_device_randomness
               --2.04%-- [...]

  0.80%    reaim  [kernel.kallsyms]    [k] _raw_spin_lock_irq
              |--92.49%-- rwsem_down_write_failed
              |--4.24%-- __schedule
              |--1.37%-- run_timer_softirq
               --1.91%-- [...]

The table below shows the % improvement in throughput (1100-2000 users)
in the various AIM7's workloads:

	Workload	% increase in throughput
	--------	------------------------
	custom			 3.8%
	five-sec		 3.5%
	fserver			 4.1%
	high_systime		22.2%
	shared			 2.1%
	short			10.1%

Signed-off-by: Waiman Long <Waiman.Long@...com>
---
 include/linux/osq_lock.h    |    5 +++
 kernel/locking/rwsem-xadd.c |   72 ++++++++++++++++++++++++++++++++++++++++++-
 2 files changed, 76 insertions(+), 1 deletions(-)

 v2->v3:
   - Fix errors in commit log.

 v1->v2:
   - Add a memory barrier before calling spin_trylock for proper memory
     ordering.

diff --git a/include/linux/osq_lock.h b/include/linux/osq_lock.h
index 3a6490e..703ea5c 100644
--- a/include/linux/osq_lock.h
+++ b/include/linux/osq_lock.h
@@ -32,4 +32,9 @@ static inline void osq_lock_init(struct optimistic_spin_queue *lock)
 extern bool osq_lock(struct optimistic_spin_queue *lock);
 extern void osq_unlock(struct optimistic_spin_queue *lock);
 
+static inline bool osq_is_locked(struct optimistic_spin_queue *lock)
+{
+	return atomic_read(&lock->tail) != OSQ_UNLOCKED_VAL;
+}
+
 #endif
diff --git a/kernel/locking/rwsem-xadd.c b/kernel/locking/rwsem-xadd.c
index 2f7cc40..f52c69a 100644
--- a/kernel/locking/rwsem-xadd.c
+++ b/kernel/locking/rwsem-xadd.c
@@ -107,6 +107,35 @@ enum rwsem_wake_type {
 	RWSEM_WAKE_READ_OWNED	/* Waker thread holds the read lock */
 };
 
+#ifdef CONFIG_RWSEM_SPIN_ON_OWNER
+/*
+ * return true if there is an active writer by checking the owner field which
+ * should be set if there is one.
+ */
+static inline bool rwsem_has_active_writer(struct rw_semaphore *sem)
+{
+	return READ_ONCE(sem->owner) != NULL;
+}
+
+/*
+ * Return true if the rwsem has active spinner
+ */
+static inline bool rwsem_has_spinner(struct rw_semaphore *sem)
+{
+	return osq_is_locked(&sem->osq);
+}
+#else /* CONFIG_RWSEM_SPIN_ON_OWNER */
+static inline bool rwsem_has_active_writer(struct rw_semaphore *sem)
+{
+	return false;	/* Assume it has no active writer */
+}
+
+static inline bool rwsem_has_spinner(struct rw_semaphore *sem)
+{
+	return false;
+}
+#endif /* CONFIG_RWSEM_SPIN_ON_OWNER */
+
 /*
  * handle the lock release when processes blocked on it that can now run
  * - if we come here from up_xxxx(), then:
@@ -125,6 +154,14 @@ __rwsem_do_wake(struct rw_semaphore *sem, enum rwsem_wake_type wake_type)
 	struct list_head *next;
 	long oldcount, woken, loop, adjustment;
 
+	/*
+	 * up_write() cleared the owner field before calling this function.
+	 * If that field is now set, a writer must have stolen the lock and
+	 * the wakeup operation should be aborted.
+	 */
+	if (rwsem_has_active_writer(sem))
+		goto out;
+
 	waiter = list_entry(sem->wait_list.next, struct rwsem_waiter, list);
 	if (waiter->type == RWSEM_WAITING_FOR_WRITE) {
 		if (wake_type == RWSEM_WAKE_ANY)
@@ -478,7 +515,40 @@ struct rw_semaphore *rwsem_wake(struct rw_semaphore *sem)
 {
 	unsigned long flags;
 
-	raw_spin_lock_irqsave(&sem->wait_lock, flags);
+	/*
+	 * If a spinner is present, it is not necessary to do the wakeup.
+	 * Try to do wakeup only if the trylock succeeds to minimize
+	 * spinlock contention which may introduce too much delay in the
+	 * unlock operation.
+	 *
+	 *    spinning writer		up_write/up_read caller
+	 *    ---------------		-----------------------
+	 * [S]   osq_unlock()		[L]   osq
+	 *	 MB			      MB
+	 * [RmW] rwsem_try_write_lock() [RmW] spin_trylock(wait_lock)
+	 *
+	 * Here, it is important to make sure that there won't be a missed
+	 * wakeup while the rwsem is free and the only spinning writer goes
+	 * to sleep without taking the rwsem. In case the spinning writer is
+	 * just going to break out of the waiting loop, it will still do a
+	 * trylock in rwsem_down_write_failed() before sleeping. IOW, if
+	 * rwsem_has_spinner() is true, it will  guarantee at least one
+	 * trylock attempt on the rwsem.
+	 */
+	if (!rwsem_has_spinner(sem)) {
+		raw_spin_lock_irqsave(&sem->wait_lock, flags);
+	} else {
+		/*
+		 * rwsem_has_spinner() is an atomic read while spin_trylock
+		 * does not guarantee a full memory barrier. Insert a memory
+		 * barrier here to make sure that wait_lock isn't read until
+		 * after osq.
+		 * Note: smp_rmb__after_atomic() should be used if available.
+		 */
+		smp_mb__after_atomic();
+		if (!raw_spin_trylock_irqsave(&sem->wait_lock, flags))
+			return sem;
+	}
 
 	/* do nothing if list empty */
 	if (!list_empty(&sem->wait_list))
-- 
1.7.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/