linux-kernel - Re: [PATCH v3] locking/pvqspinlock: Relax cmpxchg's to improve performance on some archs

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20170220045839.GJ9178@tardis.cn.ibm.com>
Date:   Mon, 20 Feb 2017 12:58:39 +0800
From:   Boqun Feng <boqun.feng@...il.com>
To:     Andrea Parri <parri.andrea@...il.com>
Cc:     Waiman Long <longman@...hat.com>,
        Peter Zijlstra <peterz@...radead.org>,
        Ingo Molnar <mingo@...hat.com>,
        Pan Xinhui <xinhui@...ux.vnet.ibm.com>,
        linux-kernel@...r.kernel.org, Will Deacon <will.deacon@....com>
Subject: Re: [PATCH v3] locking/pvqspinlock: Relax cmpxchg's to improve
 performance on some archs

(Really add Will this time ...)

On Mon, Feb 20, 2017 at 12:53:58PM +0800, Boqun Feng wrote:
> On Mon, Feb 20, 2017 at 05:20:52AM +0100, Andrea Parri wrote:
> > On Fri, Feb 17, 2017 at 03:43:40PM -0500, Waiman Long wrote:
> > > All the locking related cmpxchg's in the following functions are
> > > replaced with the _acquire variants:
> > >  - pv_queued_spin_steal_lock()
> > >  - trylock_clear_pending()
> > > 
> > > This change should help performance on architectures that use LL/SC.
> > > 
> > > On a 2-core 16-thread Power8 system with pvqspinlock explicitly
> > > enabled, the performance of a locking microbenchmark with and without
> > > this patch on a 4.10-rc8 kernel with Xinhui's PPC qspinlock patch
> > > were as follows:
> > > 
> > >   # of thread     w/o patch    with patch      % Change
> > >   -----------     ---------    ----------      --------
> > >        4         4053.3 Mop/s  4223.7 Mop/s     +4.2%
> > >        8         3310.4 Mop/s  3406.0 Mop/s     +2.9%
> > >       12         2576.4 Mop/s  2674.6 Mop/s     +3.8%
> > > 
> > > Signed-off-by: Waiman Long <longman@...hat.com>
> > > ---
> > > 
> > >  v2->v3:
> > >   - Reduce scope by relaxing cmpxchg's in fast path only.
> > > 
> > >  v1->v2:
> > >   - Add comments in changelog and code for the rationale of the change.
> > > 
> > >  kernel/locking/qspinlock_paravirt.h | 15 +++++++++------
> > >  1 file changed, 9 insertions(+), 6 deletions(-)
> > > 
> > > diff --git a/kernel/locking/qspinlock_paravirt.h b/kernel/locking/qspinlock_paravirt.h
> > > index e6b2f7a..a59dc05 100644
> > > --- a/kernel/locking/qspinlock_paravirt.h
> > > +++ b/kernel/locking/qspinlock_paravirt.h
> > > @@ -72,7 +72,7 @@ static inline bool pv_queued_spin_steal_lock(struct qspinlock *lock)
> > >  	struct __qspinlock *l = (void *)lock;
> > >  
> > >  	if (!(atomic_read(&lock->val) & _Q_LOCKED_PENDING_MASK) &&
> > > -	    (cmpxchg(&l->locked, 0, _Q_LOCKED_VAL) == 0)) {
> > > +	    (cmpxchg_acquire(&l->locked, 0, _Q_LOCKED_VAL) == 0)) {
> > >  		qstat_inc(qstat_pv_lock_stealing, true);
> > >  		return true;
> > >  	}
> > > @@ -101,16 +101,16 @@ static __always_inline void clear_pending(struct qspinlock *lock)
> > >  
> > >  /*
> > >   * The pending bit check in pv_queued_spin_steal_lock() isn't a memory
> > > - * barrier. Therefore, an atomic cmpxchg() is used to acquire the lock
> > > - * just to be sure that it will get it.
> > > + * barrier. Therefore, an atomic cmpxchg_acquire() is used to acquire the
> > > + * lock just to be sure that it will get it.
> > >   */
> > >  static __always_inline int trylock_clear_pending(struct qspinlock *lock)
> > >  {
> > >  	struct __qspinlock *l = (void *)lock;
> > >  
> > >  	return !READ_ONCE(l->locked) &&
> > > -	       (cmpxchg(&l->locked_pending, _Q_PENDING_VAL, _Q_LOCKED_VAL)
> > > -			== _Q_PENDING_VAL);
> > > +	       (cmpxchg_acquire(&l->locked_pending, _Q_PENDING_VAL,
> > > +				_Q_LOCKED_VAL) == _Q_PENDING_VAL);
> > >  }
> > >  #else /* _Q_PENDING_BITS == 8 */
> > >  static __always_inline void set_pending(struct qspinlock *lock)
> > > @@ -138,7 +138,7 @@ static __always_inline int trylock_clear_pending(struct qspinlock *lock)
> > >  		 */
> > >  		old = val;
> > >  		new = (val & ~_Q_PENDING_MASK) | _Q_LOCKED_VAL;
> > > -		val = atomic_cmpxchg(&lock->val, old, new);
> > > +		val = atomic_cmpxchg_acquire(&lock->val, old, new);
> > >  
> > >  		if (val == old)
> > >  			return 1;
> > > @@ -361,6 +361,9 @@ static void pv_kick_node(struct qspinlock *lock, struct mcs_spinlock *node)
> > >  	 * observe its next->locked value and advance itself.
> > >  	 *
> > >  	 * Matches with smp_store_mb() and cmpxchg() in pv_wait_node()
> > > +	 *
> > > +	 * We can't used relaxed form of cmpxchg here as the loading of
> > > +	 * pn->state can happen before setting next->locked in some archs.
> > >  	 */
> > >  	if (cmpxchg(&pn->state, vcpu_halted, vcpu_hashed) != vcpu_halted)
> > 
> > Hi Waiman.
> > 
> > cmpxchg() does not guarantee the (here implied) smp_mb(), in general; c.f.,
> > e.g., arch/arm64/include/asm/atomic_ll_sc.h, where cmpxchg() get "compiled"
> > to something like:
> > 
> >     _loop: ldxr; eor; cbnz _exit; stlxr; cbnz _loop; dmb ish; _exit:
> > 
> 
> Yes, sorry for be late for this one.
> 
> So Waiman, the fact is that in this case, we want the following code
> sequence:
> 
> 	CPU 0					CPU 1
> 	=================			====================
> 	{pn->state = vcpu_running, node->locked = 0}
> 
> 	smp_store_smb(&pn->state, vcpu_halted):
> 	  WRITE_ONCE(pn->state, vcpu_halted);
> 	  smp_mb();
> 	r1 = READ_ONCE(node->locked);
> 						arch_mcs_spin_unlock_contented();
> 						  WRITE_ONCE(node->locked, 1)
> 
> 						cmpxchg(&pn->state, vcpu_halted, vcpu_hashed);
> 
> never ends up in:
> 
> 	r1 == 0 && cmpxchg fail(i.e. the read part of cmpxchg reads the
> 	value vcpu_running).
> 
> We can have such a guarantee if cmpxchg has a smp_mb() before its load
> part, which is true for PPC. But semantically, cmpxchg() doesn't provide
> any order guarantee if it fails, which is true on ARM64, IIUC. (Add Will
> in Cc for his insight ;-)).
> 
> So a possible "fix"(in case ARM64 will use qspinlock some day), would be
> replace cmpxchg() with smp_mb() + cmpxchg_relaxed().
> 
> Regards,
> Boqun
> 
> >   Andrea
> > 
> > 
> > >  		return;
> > > -- 
> > > 1.8.3.1
> > > 



Download attachment "signature.asc" of type "application/pgp-signature" (489 bytes)