linux-kernel - Re: [PATCH v3] locking/pvqspinlock: Relax cmpxchg's to improve performance on some archs

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20170220045358.GI9178@tardis.cn.ibm.com>
Date:   Mon, 20 Feb 2017 12:53:58 +0800
From:   Boqun Feng <boqun.feng@...il.com>
To:     Andrea Parri <parri.andrea@...il.com>
Cc:     Waiman Long <longman@...hat.com>,
        Peter Zijlstra <peterz@...radead.org>,
        Ingo Molnar <mingo@...hat.com>,
        Pan Xinhui <xinhui@...ux.vnet.ibm.com>,
        linux-kernel@...r.kernel.org
Subject: Re: [PATCH v3] locking/pvqspinlock: Relax cmpxchg's to improve
 performance on some archs

On Mon, Feb 20, 2017 at 05:20:52AM +0100, Andrea Parri wrote:
> On Fri, Feb 17, 2017 at 03:43:40PM -0500, Waiman Long wrote:
> > All the locking related cmpxchg's in the following functions are
> > replaced with the _acquire variants:
> >  - pv_queued_spin_steal_lock()
> >  - trylock_clear_pending()
> > 
> > This change should help performance on architectures that use LL/SC.
> > 
> > On a 2-core 16-thread Power8 system with pvqspinlock explicitly
> > enabled, the performance of a locking microbenchmark with and without
> > this patch on a 4.10-rc8 kernel with Xinhui's PPC qspinlock patch
> > were as follows:
> > 
> >   # of thread     w/o patch    with patch      % Change
> >   -----------     ---------    ----------      --------
> >        4         4053.3 Mop/s  4223.7 Mop/s     +4.2%
> >        8         3310.4 Mop/s  3406.0 Mop/s     +2.9%
> >       12         2576.4 Mop/s  2674.6 Mop/s     +3.8%
> > 
> > Signed-off-by: Waiman Long <longman@...hat.com>
> > ---
> > 
> >  v2->v3:
> >   - Reduce scope by relaxing cmpxchg's in fast path only.
> > 
> >  v1->v2:
> >   - Add comments in changelog and code for the rationale of the change.
> > 
> >  kernel/locking/qspinlock_paravirt.h | 15 +++++++++------
> >  1 file changed, 9 insertions(+), 6 deletions(-)
> > 
> > diff --git a/kernel/locking/qspinlock_paravirt.h b/kernel/locking/qspinlock_paravirt.h
> > index e6b2f7a..a59dc05 100644
> > --- a/kernel/locking/qspinlock_paravirt.h
> > +++ b/kernel/locking/qspinlock_paravirt.h
> > @@ -72,7 +72,7 @@ static inline bool pv_queued_spin_steal_lock(struct qspinlock *lock)
> >  	struct __qspinlock *l = (void *)lock;
> >  
> >  	if (!(atomic_read(&lock->val) & _Q_LOCKED_PENDING_MASK) &&
> > -	    (cmpxchg(&l->locked, 0, _Q_LOCKED_VAL) == 0)) {
> > +	    (cmpxchg_acquire(&l->locked, 0, _Q_LOCKED_VAL) == 0)) {
> >  		qstat_inc(qstat_pv_lock_stealing, true);
> >  		return true;
> >  	}
> > @@ -101,16 +101,16 @@ static __always_inline void clear_pending(struct qspinlock *lock)
> >  
> >  /*
> >   * The pending bit check in pv_queued_spin_steal_lock() isn't a memory
> > - * barrier. Therefore, an atomic cmpxchg() is used to acquire the lock
> > - * just to be sure that it will get it.
> > + * barrier. Therefore, an atomic cmpxchg_acquire() is used to acquire the
> > + * lock just to be sure that it will get it.
> >   */
> >  static __always_inline int trylock_clear_pending(struct qspinlock *lock)
> >  {
> >  	struct __qspinlock *l = (void *)lock;
> >  
> >  	return !READ_ONCE(l->locked) &&
> > -	       (cmpxchg(&l->locked_pending, _Q_PENDING_VAL, _Q_LOCKED_VAL)
> > -			== _Q_PENDING_VAL);
> > +	       (cmpxchg_acquire(&l->locked_pending, _Q_PENDING_VAL,
> > +				_Q_LOCKED_VAL) == _Q_PENDING_VAL);
> >  }
> >  #else /* _Q_PENDING_BITS == 8 */
> >  static __always_inline void set_pending(struct qspinlock *lock)
> > @@ -138,7 +138,7 @@ static __always_inline int trylock_clear_pending(struct qspinlock *lock)
> >  		 */
> >  		old = val;
> >  		new = (val & ~_Q_PENDING_MASK) | _Q_LOCKED_VAL;
> > -		val = atomic_cmpxchg(&lock->val, old, new);
> > +		val = atomic_cmpxchg_acquire(&lock->val, old, new);
> >  
> >  		if (val == old)
> >  			return 1;
> > @@ -361,6 +361,9 @@ static void pv_kick_node(struct qspinlock *lock, struct mcs_spinlock *node)
> >  	 * observe its next->locked value and advance itself.
> >  	 *
> >  	 * Matches with smp_store_mb() and cmpxchg() in pv_wait_node()
> > +	 *
> > +	 * We can't used relaxed form of cmpxchg here as the loading of
> > +	 * pn->state can happen before setting next->locked in some archs.
> >  	 */
> >  	if (cmpxchg(&pn->state, vcpu_halted, vcpu_hashed) != vcpu_halted)
> 
> Hi Waiman.
> 
> cmpxchg() does not guarantee the (here implied) smp_mb(), in general; c.f.,
> e.g., arch/arm64/include/asm/atomic_ll_sc.h, where cmpxchg() get "compiled"
> to something like:
> 
>     _loop: ldxr; eor; cbnz _exit; stlxr; cbnz _loop; dmb ish; _exit:
> 

Yes, sorry for be late for this one.

So Waiman, the fact is that in this case, we want the following code
sequence:

	CPU 0					CPU 1
	=================			====================
	{pn->state = vcpu_running, node->locked = 0}

	smp_store_smb(&pn->state, vcpu_halted):
	  WRITE_ONCE(pn->state, vcpu_halted);
	  smp_mb();
	r1 = READ_ONCE(node->locked);
						arch_mcs_spin_unlock_contented();
						  WRITE_ONCE(node->locked, 1)

						cmpxchg(&pn->state, vcpu_halted, vcpu_hashed);

never ends up in:

	r1 == 0 && cmpxchg fail(i.e. the read part of cmpxchg reads the
	value vcpu_running).

We can have such a guarantee if cmpxchg has a smp_mb() before its load
part, which is true for PPC. But semantically, cmpxchg() doesn't provide
any order guarantee if it fails, which is true on ARM64, IIUC. (Add Will
in Cc for his insight ;-)).

So a possible "fix"(in case ARM64 will use qspinlock some day), would be
replace cmpxchg() with smp_mb() + cmpxchg_relaxed().

Regards,
Boqun

>   Andrea
> 
> 
> >  		return;
> > -- 
> > 1.8.3.1
> > 

Download attachment "signature.asc" of type "application/pgp-signature" (489 bytes)