linux-kernel - Re: [PATCH v5 1/4] qrwlock: A queue read/write lock implementation

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20131108235108.GA22696@linux.vnet.ibm.com>
Date:	Fri, 8 Nov 2013 15:51:08 -0800
From:	"Paul E. McKenney" <paulmck@...ux.vnet.ibm.com>
To:	Waiman Long <waiman.long@...com>
Cc:	Thomas Gleixner <tglx@...utronix.de>,
	Ingo Molnar <mingo@...hat.com>,
	"H. Peter Anvin" <hpa@...or.com>, Arnd Bergmann <arnd@...db.de>,
	linux-arch@...r.kernel.org, x86@...nel.org,
	linux-kernel@...r.kernel.org,
	Peter Zijlstra <peterz@...radead.org>,
	Steven Rostedt <rostedt@...dmis.org>,
	Andrew Morton <akpm@...ux-foundation.org>,
	Michel Lespinasse <walken@...gle.com>,
	Andi Kleen <andi@...stfloor.org>,
	Rik van Riel <riel@...hat.com>,
	Linus Torvalds <torvalds@...ux-foundation.org>,
	Raghavendra K T <raghavendra.kt@...ux.vnet.ibm.com>,
	George Spelvin <linux@...izon.com>,
	Tim Chen <tim.c.chen@...ux.intel.com>, aswin@...com,
	Scott J Norton <scott.norton@...com>
Subject: Re: [PATCH v5 1/4] qrwlock: A queue read/write lock implementation

On Fri, Nov 08, 2013 at 05:36:12PM -0500, Waiman Long wrote:
> On 11/08/2013 04:11 PM, Paul E. McKenney wrote:
> >On Mon, Nov 04, 2013 at 12:17:17PM -0500, Waiman Long wrote:
> >>Kernel	JPM	%Change from (1)
> >>------	---	----------------
> >>   1	148265		-
> >>   2	238715	       +61%
> >>   3	242048	       +63%
> >>   4	234881	       +58%
> >>
> >>The use of unfair qrwlock provides a small boost of 2%, while using
> >>fair qrwlock leads to 3% decrease of performance. However, looking
> >>at the perf profiles, we can clearly see that other bottlenecks were
> >>constraining the performance improvement.
> >>
> >>Perf profile of kernel (2):
> >>
> >>   18.20%   reaim  [kernel.kallsyms]  [k] __write_lock_failed
> >>    9.36%   reaim  [kernel.kallsyms]  [k] _raw_spin_lock_irqsave
> >>    2.91%   reaim  [kernel.kallsyms]  [k] mspin_lock
> >>    2.73%   reaim  [kernel.kallsyms]  [k] anon_vma_interval_tree_insert
> >>    2.23%      ls  [kernel.kallsyms]  [k] _raw_spin_lock_irqsave
> >>    1.29%   reaim  [kernel.kallsyms]  [k] __read_lock_failed
> >>    1.21%    true  [kernel.kallsyms]  [k] _raw_spin_lock_irqsave
> >>    1.14%   reaim  [kernel.kallsyms]  [k] zap_pte_range
> >>    1.13%   reaim  [kernel.kallsyms]  [k] _raw_spin_lock
> >>    1.04%   reaim  [kernel.kallsyms]  [k] mutex_spin_on_owner
> >>
> >>Perf profile of kernel (3):
> >>
> >>   10.57%   reaim  [kernel.kallsyms]  [k] _raw_spin_lock_irqsave
> >>    7.98%   reaim  [kernel.kallsyms]  [k] queue_write_lock_slowpath
> >>    5.83%   reaim  [kernel.kallsyms]  [k] mspin_lock
> >>    2.86%      ls  [kernel.kallsyms]  [k] _raw_spin_lock_irqsave
> >>    2.71%   reaim  [kernel.kallsyms]  [k] anon_vma_interval_tree_insert
> >>    1.52%    true  [kernel.kallsyms]  [k] _raw_spin_lock_irqsave
> >>    1.51%   reaim  [kernel.kallsyms]  [k] queue_read_lock_slowpath
> >>    1.35%   reaim  [kernel.kallsyms]  [k] mutex_spin_on_owner
> >>    1.12%   reaim  [kernel.kallsyms]  [k] zap_pte_range
> >>    1.06%   reaim  [kernel.kallsyms]  [k] perf_event_aux_ctx
> >>    1.01%   reaim  [kernel.kallsyms]  [k] perf_event_aux
> >But wouldn't kernel (4) be the one that was the most highly constrained?
> >
> >(That said, yes, I get that _raw_spin_lock_irqsave() is some lock that
> >is unrelated to the qrwlock.)
> 
> I think the performance data is a bit off as it was collected with a
> previous version that has a minor bug in it. I will rerun the test
> to get the new data.
> 
> >
> >+/**
> >+ * queue_write_can_lock- would write_trylock() succeed?
> >+ * @lock: Pointer to queue rwlock structure
> >+ */
> >+static inline int queue_write_can_lock(struct qrwlock *lock)
> >+{
> >+	union qrwcnts rwcnts;
> >+
> >+	rwcnts.rw = ACCESS_ONCE(lock->cnts.rw);
> >+	return !rwcnts.writer&&  !rwcnts.readers;
> >+}
> >+
> >+/**
> >+ * queue_read_trylock - try to acquire read lock of a queue rwlock
> >+ * @lock : Pointer to queue rwlock structure
> >+ * Return: 1 if lock acquired, 0 if failed
> >+ */
> >+static inline int queue_read_trylock(struct qrwlock *lock)
> >+{
> >+	union qrwcnts cnts;
> >+	u8 wmask;
> >+
> >+	cnts.rw = ACCESS_ONCE(lock->cnts.rw);
> >+	wmask   = cnts.fair ? QW_MASK_FAIR : QW_MASK_UNFAIR;
> >+	if (likely(!(cnts.writer&  wmask))) {
> >+		cnts.rw = xadd(&lock->cnts.rw, QRW_READER_BIAS);
> >On an unfair lock, this can momentarily make queue_read_can_lock() give
> >a false positive.  Not sure that this is a problem -- after all, the
> >return value from queue_read_can_lock() is immediately obsolete anyway.
> 
> Yes, this is an issue. However, I don't this is a big deal as you
> said. Using cmpxchg may avoid this issue, but then their will be a
> fair chance of false collision among readers. So it is probably
> something we may have to live with.
> 
> >>+/**
> >>+ * queue_write_unlock - release write lock of a queue rwlock
> >>+ * @lock : Pointer to queue rwlock structure
> >>+ */
> >>+static inline void queue_write_unlock(struct qrwlock *lock)
> >>+{
> >>+	/*
> >>+	 * Make sure that none of the critical section will be leaked out.
> >>+	 */
> >>+	smp_mb__before_clear_bit();
> >>+	ACCESS_ONCE(lock->cnts.writer) = 0;
> >>+	smp_mb__after_clear_bit();
> >How about the new smp_store_release() for this write?  Looks to me that
> >smp_mb__before_clear_bit() and smp_mb__after_clear_bit() work by accident,
> >if they in fact do work for all architectures.
> 
> Yes, I am thinking about updating the patch to use
> smp_store_release() once it is in. I don't want to create such a
> dependency for this patch yet. Anyway, clearing a bit looks the same
> as clearing a byte to me.
> 
> >>+/*
> >>+ * Compared with regular rwlock, the queue rwlock has has the following
> >>+ * advantages:
> >>+ * 1. It is more deterministic for the fair variant. Even though there is
> >>+ *    a slight chance of stealing the lock if come at the right moment, the
> >>+ *    granting of the lock is mostly in FIFO order. Even the default unfair
> >>+ *    variant is fairer at least among the writers.
> >>+ * 2. It is faster in high contention situation.
> >Sometimes, anyway!  (Referring to your performance results on top of
> >Ingo's patch.)
> 
> Will adjust the wording by adding "usually".
> 
> >
> >+/**
> >+ * wait_in_queue - Add to queue and wait until it is at the head
> >+ * @lock: Pointer to queue rwlock structure
> >+ * @node: Node pointer to be added to the queue
> >+ *
> >+ * The use of smp_wmb() is to make sure that the other CPUs see the change
> >+ * ASAP.
> >+ */
> >+static __always_inline void
> >+wait_in_queue(struct qrwlock *lock, struct qrwnode *node)
> >+{
> >+	struct qrwnode *prev;
> >+
> >+	node->next = NULL;
> >+	node->wait = true;
> >+	prev = xchg(&lock->waitq, node);
> >+	if (prev) {
> >+		prev->next = node;
> >+		smp_wmb();
> >This smp_wmb() desperately needs a comment.  Presumably it is ordering
> >the above "prev->next = node" with some later write, but what write?
> >
> >Oh...  I see the header comment above.
> >
> >Actually, memory barriers don't necessarily make things visible sooner.
> >They are instead used for ordering.  Or did you actually measure a
> >performance increase with this?  (Seems -highly- unlikely given smp_wmb()'s
> >definition on x86...)
> 
> I have some incorrect assumptions about memory barrier. Anyway, this
> issue will be gone once I use the MCS lock/unlock code.

Here is a presentation that has some diagrams that might help:

http://www.rdrop.com/users/paulmck/scalability/paper/Scaling.2013.10.25c.pdf

So, for example, if X and Y are both initially zero:

	CPU 0			CPU 1

	ACCESS_ONCE(X) = 1;	r1 = ACCESS_ONCE(Y);
	smp_wmb();		smp_rmb();
	ACCESS_ONCE(Y) = 1;	r2 = ACCESS_ONCE(X);

Then the two memory barriers enforce a conditional ordering.  The
condition is whether or not CPU 0's store to Y is seen by CPU 1's
load from Y.  If it is, then the pair of memory barriers ensure that
CPU 1's load from X sees the result of CPU 0's store to X.  In other
words, BUG_ON(r1 == 1 && r2 == 0) will never fire.

In general, if a memory access after memory barrier A happens before
a memory access before memory barrier B, then the two memory barriers
will ensure that applicable accesses before memory barrier A happen
before applicable accesses after memory barrier B.

Does that help?

							Thanx, Paul

> >>+		/*
> >>+		 * Wait until the waiting flag is off
> >>+		 */
> >>+		while (ACCESS_ONCE(node->wait))
> >>+			cpu_relax();
> >>+	}
> >>+}
> >>+
> >>+/**
> >>+ * signal_next - Signal the next one in queue to be at the head
> >>+ * @lock: Pointer to queue rwlock structure
> >>+ * @node: Node pointer to the current head of queue
> >>+ */
> >>+static __always_inline void
> >>+signal_next(struct qrwlock *lock, struct qrwnode *node)
> >>+{
> >>+	struct qrwnode *next;
> >>+
> >>+	/*
> >>+	 * Try to notify the next node first without disturbing the cacheline
> >>+	 * of the lock. If that fails, check to see if it is the last node
> >>+	 * and so should clear the wait queue.
> >>+	 */
> >>+	next = ACCESS_ONCE(node->next);
> >>+	if (likely(next))
> >>+		goto notify_next;
> >>+
> >>+	/*
> >>+	 * Clear the wait queue if it is the last node
> >>+	 */
> >>+	if ((ACCESS_ONCE(lock->waitq) == node)&&
> >>+	    (cmpxchg(&lock->waitq, node, NULL) == node))
> >>+			return;
> >>+	/*
> >>+	 * Wait until the next one in queue set up the next field
> >>+	 */
> >>+	while (likely(!(next = ACCESS_ONCE(node->next))))
> >>+		cpu_relax();
> >>+	/*
> >>+	 * The next one in queue is now at the head
> >>+	 */
> >>+notify_next:
> >>+	barrier();
> >>+	ACCESS_ONCE(next->wait) = false;
> >>+	smp_wmb();
> >Because smp_wmb() does not order reads, reads from the critical section
> >could leak out of the critical section.  A full memory barrier (smp_mb())
> >seems necessary to avoid this.
> >
> >Yes, you do have full memory barriers implicit in various atomic operations,
> >but it appears to be possible to avoid them all in some situations.
> 
> Yes, you are right.
> 
> >>+/**
> >>+ * queue_write_3step_lock - acquire write lock in 3 steps
> >>+ * @lock : Pointer to queue rwlock structure
> >>+ * Return: 1 if lock acquired, 0 otherwise
> >>+ *
> >>+ * Step 1 - Try to acquire the lock directly if no reader is present
> >>+ * Step 2 - Set the waiting flag to notify readers that a writer is waiting
> >>+ * Step 3 - When the readers field goes to 0, set the locked flag
> >>+ *
> >>+ * When not in fair mode, the readers actually ignore the second step.
> >>+ * However, this is still necessary to force other writers to fall in line.
> >>+ */
> >>+static __always_inline int queue_write_3step_lock(struct qrwlock *lock)
> >>+{
> >>+	union qrwcnts old, new;
> >>+
> >>+	old.rw = ACCESS_ONCE(lock->cnts.rw);
> >>+
> >>+	/* Step 1 */
> >>+	if (!old.writer&  !old.readers) {
> >>+		new.rw     = old.rw;
> >>+		new.writer = QW_LOCKED;
> >>+		if (likely(cmpxchg(&lock->cnts.rw, old.rw, new.rw) == old.rw))
> >>+			return 1;
> >>+	}
> >>+
> >>+	/* Step 2 */
> >>+	if (old.writer || (cmpxchg(&lock->cnts.writer, 0, QW_WAITING) != 0))
> >>+		return 0;
> >>+
> >>+	/* Step 3 */
> >>+	while (true) {
> >>+		cpu_relax();
> >>+		old.rw = ACCESS_ONCE(lock->cnts.rw);
> >Suppose that there now is a writer, but no readers...
> >
> >>+		if (!old.readers) {
> >>+			new.rw     = old.rw;
> >>+			new.writer = QW_LOCKED;
> >>+			if (likely(cmpxchg(&lock->cnts.rw, old.rw, new.rw)
> >>+				== old.rw))
> >... can't this mistakenly hand out the lock to a second writer?
> >
> >Ah, the trick is that we are at the head of the queue, so the only writer
> >we can possibly contend with is a prior holder of the lock.  Once that
> >writer leaves, no other writer but can appear.  And the QW_WAITING bit
> >prevents new writers from immediately grabbing the lock.
> 
> Yes, that is the point. The current has 2 queue for writers - a one
> position queue in the waiting bit and the MCS locking queue that are
> used by both readers and writers.
> 
> >>+				return 1;
> >>+		}
> >>+	}
> >>+	/* Should never reach here */
> >>+	return 0;
> >>+}
> >>+
> >>+/**
> >>+ * queue_write_lock_slowpath - acquire write lock of a queue rwlock
> >>+ * @lock : Pointer to queue rwlock structure
> >>+ */
> >>+void queue_write_lock_slowpath(struct qrwlock *lock)
> >>+{
> >>+	struct qrwnode node;
> >>+
> >>+	/*
> >>+	 * Put the writer into the wait queue
> >>+	 */
> >>+	wait_in_queue(lock,&node);
> >>+
> >>+	/*
> >>+	 * At the head of the wait queue now, call queue_write_3step_lock()
> >>+	 * to acquire the lock until it is done.
> >>+	 */
> >>+	while (!queue_write_3step_lock(lock))
> >>+		cpu_relax();
> >If we get here, queue_write_3step_lock() just executed a successful
> >cmpxchg(), which implies a full memory barrier.  This prevents the
> >critical section from leaking out, good!
> >
> >>+	signal_next(lock,&node);
> >>+}
> >>+EXPORT_SYMBOL(queue_write_lock_slowpath);
> >>-- 
> >>1.7.1
> >>
> >--
> >To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> >the body of a message to majordomo@...r.kernel.org
> >More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >Please read the FAQ at  http://www.tux.org/lkml/
> 
> Thank for the review.
> 
> -Longman
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/