[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20131108235108.GA22696@linux.vnet.ibm.com>
Date: Fri, 8 Nov 2013 15:51:08 -0800
From: "Paul E. McKenney" <paulmck@...ux.vnet.ibm.com>
To: Waiman Long <waiman.long@...com>
Cc: Thomas Gleixner <tglx@...utronix.de>,
Ingo Molnar <mingo@...hat.com>,
"H. Peter Anvin" <hpa@...or.com>, Arnd Bergmann <arnd@...db.de>,
linux-arch@...r.kernel.org, x86@...nel.org,
linux-kernel@...r.kernel.org,
Peter Zijlstra <peterz@...radead.org>,
Steven Rostedt <rostedt@...dmis.org>,
Andrew Morton <akpm@...ux-foundation.org>,
Michel Lespinasse <walken@...gle.com>,
Andi Kleen <andi@...stfloor.org>,
Rik van Riel <riel@...hat.com>,
Linus Torvalds <torvalds@...ux-foundation.org>,
Raghavendra K T <raghavendra.kt@...ux.vnet.ibm.com>,
George Spelvin <linux@...izon.com>,
Tim Chen <tim.c.chen@...ux.intel.com>, aswin@...com,
Scott J Norton <scott.norton@...com>
Subject: Re: [PATCH v5 1/4] qrwlock: A queue read/write lock implementation
On Fri, Nov 08, 2013 at 05:36:12PM -0500, Waiman Long wrote:
> On 11/08/2013 04:11 PM, Paul E. McKenney wrote:
> >On Mon, Nov 04, 2013 at 12:17:17PM -0500, Waiman Long wrote:
> >>Kernel JPM %Change from (1)
> >>------ --- ----------------
> >> 1 148265 -
> >> 2 238715 +61%
> >> 3 242048 +63%
> >> 4 234881 +58%
> >>
> >>The use of unfair qrwlock provides a small boost of 2%, while using
> >>fair qrwlock leads to 3% decrease of performance. However, looking
> >>at the perf profiles, we can clearly see that other bottlenecks were
> >>constraining the performance improvement.
> >>
> >>Perf profile of kernel (2):
> >>
> >> 18.20% reaim [kernel.kallsyms] [k] __write_lock_failed
> >> 9.36% reaim [kernel.kallsyms] [k] _raw_spin_lock_irqsave
> >> 2.91% reaim [kernel.kallsyms] [k] mspin_lock
> >> 2.73% reaim [kernel.kallsyms] [k] anon_vma_interval_tree_insert
> >> 2.23% ls [kernel.kallsyms] [k] _raw_spin_lock_irqsave
> >> 1.29% reaim [kernel.kallsyms] [k] __read_lock_failed
> >> 1.21% true [kernel.kallsyms] [k] _raw_spin_lock_irqsave
> >> 1.14% reaim [kernel.kallsyms] [k] zap_pte_range
> >> 1.13% reaim [kernel.kallsyms] [k] _raw_spin_lock
> >> 1.04% reaim [kernel.kallsyms] [k] mutex_spin_on_owner
> >>
> >>Perf profile of kernel (3):
> >>
> >> 10.57% reaim [kernel.kallsyms] [k] _raw_spin_lock_irqsave
> >> 7.98% reaim [kernel.kallsyms] [k] queue_write_lock_slowpath
> >> 5.83% reaim [kernel.kallsyms] [k] mspin_lock
> >> 2.86% ls [kernel.kallsyms] [k] _raw_spin_lock_irqsave
> >> 2.71% reaim [kernel.kallsyms] [k] anon_vma_interval_tree_insert
> >> 1.52% true [kernel.kallsyms] [k] _raw_spin_lock_irqsave
> >> 1.51% reaim [kernel.kallsyms] [k] queue_read_lock_slowpath
> >> 1.35% reaim [kernel.kallsyms] [k] mutex_spin_on_owner
> >> 1.12% reaim [kernel.kallsyms] [k] zap_pte_range
> >> 1.06% reaim [kernel.kallsyms] [k] perf_event_aux_ctx
> >> 1.01% reaim [kernel.kallsyms] [k] perf_event_aux
> >But wouldn't kernel (4) be the one that was the most highly constrained?
> >
> >(That said, yes, I get that _raw_spin_lock_irqsave() is some lock that
> >is unrelated to the qrwlock.)
>
> I think the performance data is a bit off as it was collected with a
> previous version that has a minor bug in it. I will rerun the test
> to get the new data.
>
> >
> >+/**
> >+ * queue_write_can_lock- would write_trylock() succeed?
> >+ * @lock: Pointer to queue rwlock structure
> >+ */
> >+static inline int queue_write_can_lock(struct qrwlock *lock)
> >+{
> >+ union qrwcnts rwcnts;
> >+
> >+ rwcnts.rw = ACCESS_ONCE(lock->cnts.rw);
> >+ return !rwcnts.writer&& !rwcnts.readers;
> >+}
> >+
> >+/**
> >+ * queue_read_trylock - try to acquire read lock of a queue rwlock
> >+ * @lock : Pointer to queue rwlock structure
> >+ * Return: 1 if lock acquired, 0 if failed
> >+ */
> >+static inline int queue_read_trylock(struct qrwlock *lock)
> >+{
> >+ union qrwcnts cnts;
> >+ u8 wmask;
> >+
> >+ cnts.rw = ACCESS_ONCE(lock->cnts.rw);
> >+ wmask = cnts.fair ? QW_MASK_FAIR : QW_MASK_UNFAIR;
> >+ if (likely(!(cnts.writer& wmask))) {
> >+ cnts.rw = xadd(&lock->cnts.rw, QRW_READER_BIAS);
> >On an unfair lock, this can momentarily make queue_read_can_lock() give
> >a false positive. Not sure that this is a problem -- after all, the
> >return value from queue_read_can_lock() is immediately obsolete anyway.
>
> Yes, this is an issue. However, I don't this is a big deal as you
> said. Using cmpxchg may avoid this issue, but then their will be a
> fair chance of false collision among readers. So it is probably
> something we may have to live with.
>
> >>+/**
> >>+ * queue_write_unlock - release write lock of a queue rwlock
> >>+ * @lock : Pointer to queue rwlock structure
> >>+ */
> >>+static inline void queue_write_unlock(struct qrwlock *lock)
> >>+{
> >>+ /*
> >>+ * Make sure that none of the critical section will be leaked out.
> >>+ */
> >>+ smp_mb__before_clear_bit();
> >>+ ACCESS_ONCE(lock->cnts.writer) = 0;
> >>+ smp_mb__after_clear_bit();
> >How about the new smp_store_release() for this write? Looks to me that
> >smp_mb__before_clear_bit() and smp_mb__after_clear_bit() work by accident,
> >if they in fact do work for all architectures.
>
> Yes, I am thinking about updating the patch to use
> smp_store_release() once it is in. I don't want to create such a
> dependency for this patch yet. Anyway, clearing a bit looks the same
> as clearing a byte to me.
>
> >>+/*
> >>+ * Compared with regular rwlock, the queue rwlock has has the following
> >>+ * advantages:
> >>+ * 1. It is more deterministic for the fair variant. Even though there is
> >>+ * a slight chance of stealing the lock if come at the right moment, the
> >>+ * granting of the lock is mostly in FIFO order. Even the default unfair
> >>+ * variant is fairer at least among the writers.
> >>+ * 2. It is faster in high contention situation.
> >Sometimes, anyway! (Referring to your performance results on top of
> >Ingo's patch.)
>
> Will adjust the wording by adding "usually".
>
> >
> >+/**
> >+ * wait_in_queue - Add to queue and wait until it is at the head
> >+ * @lock: Pointer to queue rwlock structure
> >+ * @node: Node pointer to be added to the queue
> >+ *
> >+ * The use of smp_wmb() is to make sure that the other CPUs see the change
> >+ * ASAP.
> >+ */
> >+static __always_inline void
> >+wait_in_queue(struct qrwlock *lock, struct qrwnode *node)
> >+{
> >+ struct qrwnode *prev;
> >+
> >+ node->next = NULL;
> >+ node->wait = true;
> >+ prev = xchg(&lock->waitq, node);
> >+ if (prev) {
> >+ prev->next = node;
> >+ smp_wmb();
> >This smp_wmb() desperately needs a comment. Presumably it is ordering
> >the above "prev->next = node" with some later write, but what write?
> >
> >Oh... I see the header comment above.
> >
> >Actually, memory barriers don't necessarily make things visible sooner.
> >They are instead used for ordering. Or did you actually measure a
> >performance increase with this? (Seems -highly- unlikely given smp_wmb()'s
> >definition on x86...)
>
> I have some incorrect assumptions about memory barrier. Anyway, this
> issue will be gone once I use the MCS lock/unlock code.
Here is a presentation that has some diagrams that might help:
http://www.rdrop.com/users/paulmck/scalability/paper/Scaling.2013.10.25c.pdf
So, for example, if X and Y are both initially zero:
CPU 0 CPU 1
ACCESS_ONCE(X) = 1; r1 = ACCESS_ONCE(Y);
smp_wmb(); smp_rmb();
ACCESS_ONCE(Y) = 1; r2 = ACCESS_ONCE(X);
Then the two memory barriers enforce a conditional ordering. The
condition is whether or not CPU 0's store to Y is seen by CPU 1's
load from Y. If it is, then the pair of memory barriers ensure that
CPU 1's load from X sees the result of CPU 0's store to X. In other
words, BUG_ON(r1 == 1 && r2 == 0) will never fire.
In general, if a memory access after memory barrier A happens before
a memory access before memory barrier B, then the two memory barriers
will ensure that applicable accesses before memory barrier A happen
before applicable accesses after memory barrier B.
Does that help?
Thanx, Paul
> >>+ /*
> >>+ * Wait until the waiting flag is off
> >>+ */
> >>+ while (ACCESS_ONCE(node->wait))
> >>+ cpu_relax();
> >>+ }
> >>+}
> >>+
> >>+/**
> >>+ * signal_next - Signal the next one in queue to be at the head
> >>+ * @lock: Pointer to queue rwlock structure
> >>+ * @node: Node pointer to the current head of queue
> >>+ */
> >>+static __always_inline void
> >>+signal_next(struct qrwlock *lock, struct qrwnode *node)
> >>+{
> >>+ struct qrwnode *next;
> >>+
> >>+ /*
> >>+ * Try to notify the next node first without disturbing the cacheline
> >>+ * of the lock. If that fails, check to see if it is the last node
> >>+ * and so should clear the wait queue.
> >>+ */
> >>+ next = ACCESS_ONCE(node->next);
> >>+ if (likely(next))
> >>+ goto notify_next;
> >>+
> >>+ /*
> >>+ * Clear the wait queue if it is the last node
> >>+ */
> >>+ if ((ACCESS_ONCE(lock->waitq) == node)&&
> >>+ (cmpxchg(&lock->waitq, node, NULL) == node))
> >>+ return;
> >>+ /*
> >>+ * Wait until the next one in queue set up the next field
> >>+ */
> >>+ while (likely(!(next = ACCESS_ONCE(node->next))))
> >>+ cpu_relax();
> >>+ /*
> >>+ * The next one in queue is now at the head
> >>+ */
> >>+notify_next:
> >>+ barrier();
> >>+ ACCESS_ONCE(next->wait) = false;
> >>+ smp_wmb();
> >Because smp_wmb() does not order reads, reads from the critical section
> >could leak out of the critical section. A full memory barrier (smp_mb())
> >seems necessary to avoid this.
> >
> >Yes, you do have full memory barriers implicit in various atomic operations,
> >but it appears to be possible to avoid them all in some situations.
>
> Yes, you are right.
>
> >>+/**
> >>+ * queue_write_3step_lock - acquire write lock in 3 steps
> >>+ * @lock : Pointer to queue rwlock structure
> >>+ * Return: 1 if lock acquired, 0 otherwise
> >>+ *
> >>+ * Step 1 - Try to acquire the lock directly if no reader is present
> >>+ * Step 2 - Set the waiting flag to notify readers that a writer is waiting
> >>+ * Step 3 - When the readers field goes to 0, set the locked flag
> >>+ *
> >>+ * When not in fair mode, the readers actually ignore the second step.
> >>+ * However, this is still necessary to force other writers to fall in line.
> >>+ */
> >>+static __always_inline int queue_write_3step_lock(struct qrwlock *lock)
> >>+{
> >>+ union qrwcnts old, new;
> >>+
> >>+ old.rw = ACCESS_ONCE(lock->cnts.rw);
> >>+
> >>+ /* Step 1 */
> >>+ if (!old.writer& !old.readers) {
> >>+ new.rw = old.rw;
> >>+ new.writer = QW_LOCKED;
> >>+ if (likely(cmpxchg(&lock->cnts.rw, old.rw, new.rw) == old.rw))
> >>+ return 1;
> >>+ }
> >>+
> >>+ /* Step 2 */
> >>+ if (old.writer || (cmpxchg(&lock->cnts.writer, 0, QW_WAITING) != 0))
> >>+ return 0;
> >>+
> >>+ /* Step 3 */
> >>+ while (true) {
> >>+ cpu_relax();
> >>+ old.rw = ACCESS_ONCE(lock->cnts.rw);
> >Suppose that there now is a writer, but no readers...
> >
> >>+ if (!old.readers) {
> >>+ new.rw = old.rw;
> >>+ new.writer = QW_LOCKED;
> >>+ if (likely(cmpxchg(&lock->cnts.rw, old.rw, new.rw)
> >>+ == old.rw))
> >... can't this mistakenly hand out the lock to a second writer?
> >
> >Ah, the trick is that we are at the head of the queue, so the only writer
> >we can possibly contend with is a prior holder of the lock. Once that
> >writer leaves, no other writer but can appear. And the QW_WAITING bit
> >prevents new writers from immediately grabbing the lock.
>
> Yes, that is the point. The current has 2 queue for writers - a one
> position queue in the waiting bit and the MCS locking queue that are
> used by both readers and writers.
>
> >>+ return 1;
> >>+ }
> >>+ }
> >>+ /* Should never reach here */
> >>+ return 0;
> >>+}
> >>+
> >>+/**
> >>+ * queue_write_lock_slowpath - acquire write lock of a queue rwlock
> >>+ * @lock : Pointer to queue rwlock structure
> >>+ */
> >>+void queue_write_lock_slowpath(struct qrwlock *lock)
> >>+{
> >>+ struct qrwnode node;
> >>+
> >>+ /*
> >>+ * Put the writer into the wait queue
> >>+ */
> >>+ wait_in_queue(lock,&node);
> >>+
> >>+ /*
> >>+ * At the head of the wait queue now, call queue_write_3step_lock()
> >>+ * to acquire the lock until it is done.
> >>+ */
> >>+ while (!queue_write_3step_lock(lock))
> >>+ cpu_relax();
> >If we get here, queue_write_3step_lock() just executed a successful
> >cmpxchg(), which implies a full memory barrier. This prevents the
> >critical section from leaking out, good!
> >
> >>+ signal_next(lock,&node);
> >>+}
> >>+EXPORT_SYMBOL(queue_write_lock_slowpath);
> >>--
> >>1.7.1
> >>
> >--
> >To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> >the body of a message to majordomo@...r.kernel.org
> >More majordomo info at http://vger.kernel.org/majordomo-info.html
> >Please read the FAQ at http://www.tux.org/lkml/
>
> Thank for the review.
>
> -Longman
>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists