lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-Id: <1483028026-10305-17-git-send-email-longman@redhat.com>
Date:   Thu, 29 Dec 2016 11:13:42 -0500
From:   Waiman Long <longman@...hat.com>
To:     Thomas Gleixner <tglx@...utronix.de>,
        Ingo Molnar <mingo@...nel.org>,
        Peter Zijlstra <peterz@...radead.org>,
        Jonathan Corbet <corbet@....net>
Cc:     linux-kernel@...r.kernel.org, linux-doc@...r.kernel.org,
        Arnaldo Carvalho de Melo <acme@...nel.org>,
        Davidlohr Bueso <dave@...olabs.net>,
        Mike Galbraith <umgwanakikbuti@...il.com>,
        Scott J Norton <scott.norton@....com>,
        Waiman Long <longman@...hat.com>
Subject: [PATCH v4 16/20] TP-futex: Group readers together in wait queue

All the TP futex lock waiters are serialized in the kernel using a
kernel mutex which acts like a wait queue. The order at which the
waiters popped out from the wait queue will affect performance when
exclusive (writer) and shared (reader) lock waiters are mixed in the
queue. The worst case scenarios will be something like RWRWRW... in
term of ordering where no parallelism in term of lock ownership
can happen.

To improve throughput, the readers are now grouped together as a single
entity in the wait queue. The first reader that enters the mutex wait
queue will become the leader of the group. The other readers will
spin on the group leader via an OSQ lock. When the futex is put in
the shared mode, either by the group leader or by an external reader
the spinning readers in the reader group will then acquire the read
lock successively.

The spinning readers in the group will get disbanded when the group
leader goes to sleep. In this case, all those readers will go into the
mutex wait queue alone and wait for their turn to acquire the TP futex.

On a 2-socket 36-core E5-2699 v3 system (HT off) running on a 4.10
based kernel, the performance of TP rwlock with 1:1 reader/writer
ratio versus one based on the wait-wake futexes as well as the Glibc
rwlock with a microbenchmark (1 worker thread per cpu core) running
for 10s were as follows:

                          WW futex       TP futex         Glibc
                          --------       --------         -----
Total locking ops        35,707,234     58,645,434     10,930,422
Per-thread avg/sec           99,149        162,887         30,362
Per-thread min/sec           93,190         38,641         29,872
Per-thread max/sec          104,213        225,983         30,708
Write lock futex calls   11,161,534         14,094          -
Write unlock futex calls  8,696,121            167          -
Read lock futex calls     1,717,863          6,659          -
Read unlock futex calls   4,316,418            323          -

It can be seen that the throughput of the TP futex is close to 2X
the WW futex and almost 6X the Glibc version in this particular case.

The following table shows the CPU cores scaling for the average
per-thread locking rates (ops/sec):

                          WW futex       TP futex       Glibc
                          --------       --------       -----
  9 threads (1 socket)    422,014        647,006       190,236
 18 threads (2 sockets)   197,145        330,353        66,934
 27 threads (2 sockets)   127,947        213,417        43,641
 36 threads (2 sockets)    99,149        162,887        30,362

The following tables shows the average per-thread locking rates
(36 threads) with different reader percentages:

                          WW futex       TP futex       Glibc
                          --------       --------       -----
   90% readers            124,426        159,657        60,208
   95% readers            148,905        152,315        68,666
  100% reader             210,029        191,759        84,316

The rwlocks based on the WW futexes and Glibc prefer readers by
default. So it can be seen that the performance of the WW futex and
Glibc rwlocks increased with higher reader percentages. The TP futex
rwlock, however, prefers writers a bit more than readers. So the
performance didn't increase as the reader percentage rises.

The WW futexes and Glibc rwlocks also have a writer-preferring version.
Their performance with the same tests are as follows:

                          WW futex       TP futex       Glibc
                          --------       --------       -----
   90% readers             87,866        159,657        26,057
   95% readers             93,193        152,315        32,611
  100% reader             193,267        191,759        88,440

With separate 18 reader and 18 writer threads, the the average
per-thread reader and writer locking rates with different load
latencies (L, default = 1) and reader-preferring rwlocks are:

                          WW futex       TP futex       Glibc
                          --------       --------       -----
 Reader rate, L=1         381,411         74,059       164,184
 Writer rate, L=1               0        240,841             0

 Reader rate, L=5         330,732         57,361       150,691
 Writer rate, L=5               0        175,400             0

 Reader rate, L=50        304,505         17,355        97,066
 Writer rate, L=50              0        114,504             0

The corresponding locking rates with writer-preferring rwlocks are:

                          WW futex       TP futex       Glibc
                          --------       --------       -----
 Reader rate, L=1         138,805         74,059            54
 Writer rate, L=1          31,113        240,841        56,424

 Reader rate, L=5         114,414         57,361            24
 Writer rate, L=5          28,062        175,400        52,039

 Reader rate, L=50         88,619         51,483             5
 Writer rate, L=50         21,005         98,005        49,885

With both the WW futex and Glibc rwlocks, lock starvation happened
for the writers with reader-preferring rwlocks. For writer preferring
rwlocks, the WW futex one fared better. The Glibc one, however,
was close to starving the readers.

The TP futex prefers writer in general, but the actual preference
depends on the timing. Lock starvation should not happen on the TP
futexes as long as the underlying kernel mutex is lock starvation
free which is the case for 4.10 and later kernel.

Signed-off-by: Waiman Long <longman@...hat.com>
---
 kernel/futex.c | 135 +++++++++++++++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 132 insertions(+), 3 deletions(-)

diff --git a/kernel/futex.c b/kernel/futex.c
index b690a79..5e69c44 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -239,6 +239,16 @@ struct futex_state {
 	u32 handoff_pid;			/* For TP futexes only */
 	int locksteal_disabled;			/* For TP futexes only */
 
+	/*
+	 * To improve reader throughput in TP futexes, all the readers
+	 * in the mutex queue are grouped together. The first reader in the
+	 * queue will set first_reader, then the rest of the readers will
+	 * spin on the first reader via the OSQ without actually entering
+	 * the mutex queue.
+	 */
+	struct optimistic_spin_queue reader_osq;
+	struct task_struct *first_reader;
+
 	enum futex_type type;
 	union futex_key key;
 };
@@ -3388,14 +3398,21 @@ void exit_robust_list(struct task_struct *curr)
  *		   0 - steals the lock
  *		   1 - top waiter (mutex owner) acquires the lock
  *		   2 - handed off the lock
- * 2) bits 08-15: reserved
- * 3) bits 15-30: how many times the task has slept or yield to scheduler
+ * 2) bit  08:	  1 if reader spins alone		(shared lock only)
+ *    bit  09:	  1 if reader is a spin group leader	(shared lock only)
+ *    bits 10-16: reserved
+ * 3) bits 16-30: how many times the task has slept or yield to scheduler
  *		  in futex_spin_on_owner().
  */
 #define TP_LOCK_STOLEN		0
 #define TP_LOCK_ACQUIRED	1
 #define TP_LOCK_HANDOFF		2
+
+#define TP_READER_ALONE		1
+#define TP_READER_GROUP		2
+
 #define TP_STATUS_SLEEP(val, sleep)	((val)|((sleep) << 16))
+#define TP_STATUS_ALONE(val, alone)	((val)|((alone) << 8))
 
 /**
  * lookup_futex_state - Looking up the futex state structure.
@@ -3438,6 +3455,7 @@ void exit_robust_list(struct task_struct *curr)
 	state->type = TYPE_TP;
 	state->key = *key;
 	state->locksteal_disabled = false;
+	osq_lock_init(&state->reader_osq);
 	list_add(&state->fs_list, &hb->fs_head);
 	WARN_ON(atomic_read(&state->refcount) != 1);
 
@@ -3895,6 +3913,98 @@ static int futex_spin_on_owner(u32 __user *uaddr, const u32 vpid,
 	return (ret < 0) ? ret : TP_STATUS_SLEEP(ret, nsleep);
 }
 
+/**
+ * futex_spin_on_reader - Optimistically spin on first reader
+ * @uaddr:   futex address
+ * @pfirst:  pointer to first reader
+ * @state:   futex state object
+ * @timeout: hrtimer_sleeper structure
+ *
+ * Reader performance will depend on the placement of readers within the
+ * mutex queue. For a queue of 4 readers and 4 writers, for example, the
+ * optimal placement will be either RRRRWWWW or WWWWRRRR. A worse case will
+ * be RWRWRWRW.
+ *
+ * One way to avoid the worst case scenario is to gather all the readers
+ * together as a single unit and place it into the mutex queue. This is done
+ * by having the first reader puts its task structure into state->first_reader
+ * and the rest of the readers optimistically spin on it instead of entering
+ * the mutex queue.
+ *
+ * On exit from this function, the reader would have
+ *  1) acquired a read lock on the futex;
+ *  2) become the first reader and goes into the mutex queue; or
+ *  3) seen the first reader slept and needs to go into the mutex queue alone.
+ *
+ * Any fault on accessing the futex will cause it to return 0 and goes into
+ * the mutex queue.
+ *
+ * Return:  > 0 - acquired the read lock
+ *	   <= 0 - need to goes into the mutex queue
+ */
+static int futex_spin_on_reader(u32 __user *uaddr, struct task_struct **pfirst,
+				struct futex_state *state,
+				struct hrtimer_sleeper *timeout)
+{
+	struct task_struct *first_reader;
+	int ret = 0;
+
+	preempt_disable();
+
+retry:
+	first_reader = READ_ONCE(state->first_reader);
+	if (!first_reader)
+		first_reader = cmpxchg(&state->first_reader, NULL, current);
+	if (!first_reader)
+		goto out;	/* Became the first reader */
+
+	if (!osq_lock(&state->reader_osq))
+		goto reschedule;
+
+	for (;;) {
+		u32 uval;
+
+		if (!state->locksteal_disabled) {
+			ret = futex_trylock_preempt_disabled(uaddr,
+						FUTEX_SHARED, &uval);
+			/*
+			 * Return if lock acquired or an error happened
+			 */
+			if (ret)
+				break;
+		}
+
+		/*
+		 * Reread the first reader value again.
+		 */
+		first_reader = READ_ONCE(state->first_reader);
+		if (!first_reader)
+			first_reader = cmpxchg(&state->first_reader, NULL,
+						current);
+		if (!first_reader || !first_reader->on_cpu)
+			break;
+
+		if (need_resched()) {
+			osq_unlock(&state->reader_osq);
+			goto reschedule;
+		}
+
+		cpu_relax();
+	}
+	osq_unlock(&state->reader_osq);
+out:
+	*pfirst = first_reader;
+	preempt_enable();
+	return ret;
+
+reschedule:
+	/*
+	 * Yield the CPU and retry later.
+	 */
+	schedule_preempt_disabled();
+	goto retry;
+}
+
 /*
  * Userspace tried a 0 -> TID atomic transition of the futex value
  * and failed. The kernel side here does the whole locking operation.
@@ -3921,9 +4031,11 @@ static noinline int futex_lock(u32 __user *uaddr, unsigned int flags,
 {
 	struct hrtimer_sleeper timeout, *to;
 	struct futex_hash_bucket *hb;
+	struct task_struct *first_reader = NULL;
 	union futex_key key = FUTEX_KEY_INIT;
 	struct futex_state *state;
 	u32 uval, vpid = shared ? FUTEX_SHARED : task_pid_vnr(current);
+	int alone = 0;
 	int ret;
 
 	/*
@@ -3989,6 +4101,17 @@ static noinline int futex_lock(u32 __user *uaddr, unsigned int flags,
 		hrtimer_start_expires(&to->timer, HRTIMER_MODE_ABS);
 
 	/*
+	 * Spin on the first reader and return if we acquired the read lock.
+	 */
+	if (shared) {
+		int rspin_ret = futex_spin_on_reader(uaddr, &first_reader,
+						     state, to);
+		if (rspin_ret > 0)
+			goto out_put_state_key;
+		alone = first_reader ? TP_READER_ALONE : TP_READER_GROUP;
+	}
+
+	/*
 	 * Acquiring the serialization mutex.
 	 *
 	 * If we got a signal or has some other error, we need to abort
@@ -4016,6 +4139,12 @@ static noinline int futex_lock(u32 __user *uaddr, unsigned int flags,
 	mutex_unlock(&state->mutex);
 
 out_put_state_key:
+	/*
+	 * We will be the first reader if (first_reader == NULL).
+	 */
+	if (shared && !first_reader)
+		WRITE_ONCE(state->first_reader, NULL);
+
 	if (!put_futex_state_unlocked(state)) {
 		/*
 		 * May need to free the futex state object and so must be
@@ -4032,7 +4161,7 @@ static noinline int futex_lock(u32 __user *uaddr, unsigned int flags,
 		hrtimer_cancel(&to->timer);
 		destroy_hrtimer_on_stack(&to->timer);
 	}
-	return ret;
+	return (ret < 0) ? ret : TP_STATUS_ALONE(ret, alone);
 }
 
 /*
-- 
1.8.3.1

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ