linux-kernel - Re: [PATCH v2] locking/pvqspinlock: Relax cmpxchg's to improve performance on some archs

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20170208060939.GC9178@tardis.cn.ibm.com>
Date:   Wed, 8 Feb 2017 14:09:39 +0800
From:   Boqun Feng <boqun.feng@...il.com>
To:     Xinhui Pan <mnipxh@...il.com>
Cc:     Waiman Long <longman@...hat.com>,
        Peter Zijlstra <peterz@...radead.org>,
        Ingo Molnar <mingo@...hat.com>, linux-kernel@...r.kernel.org,
        Pan Xinhui <xinhui@...ux.vnet.ibm.com>
Subject: Re: [PATCH v2] locking/pvqspinlock: Relax cmpxchg's to improve
 performance on some archs

On Wed, Feb 08, 2017 at 12:05:40PM +0800, Boqun Feng wrote:
> On Wed, Feb 08, 2017 at 11:39:10AM +0800, Xinhui Pan wrote:
> > 2016-12-26 4:26 GMT+08:00 Waiman Long <longman@...hat.com>:
> > 
> > > A number of cmpxchg calls in qspinlock_paravirt.h were replaced by more
> > > relaxed versions to improve performance on architectures that use LL/SC.
> > >
> > > All the locking related cmpxchg's are replaced with the _acquire
> > > variants:
> > >  - pv_queued_spin_steal_lock()
> > >  - trylock_clear_pending()
> > >
> > > The cmpxchg's related to hashing are replaced by either by the _release
> > > or the _relaxed variants. See the inline comment for details.
> > >
> > > Signed-off-by: Waiman Long <longman@...hat.com>
> > >
> > >  v1->v2:
> > >   - Add comments in changelog and code for the rationale of the change.
> > >
> > > ---
> > >  kernel/locking/qspinlock_paravirt.h | 50 ++++++++++++++++++++++++------
> > > -------
> > >  1 file changed, 33 insertions(+), 17 deletions(-)
> > >
> > >
> > > @@ -323,8 +329,14 @@ static void pv_wait_node(struct mcs_spinlock *node,
> > > struct mcs_spinlock *prev)
> > >                  * If pv_kick_node() changed us to vcpu_hashed, retain that
> > >                  * value so that pv_wait_head_or_lock() knows to not also
> > > try
> > >                  * to hash this lock.
> > > +                *
> > > +                * The smp_store_mb() and control dependency above will
> > > ensure
> > > +                * that state change won't happen before that.
> > > Synchronizing
> > > +                * with pv_kick_node() wrt hashing by this waiter or by the
> > > +                * lock holder is done solely by the state variable. There
> > > is
> > > +                * no other ordering requirement.
> > >                  */
> > > -               cmpxchg(&pn->state, vcpu_halted, vcpu_running);
> > > +               cmpxchg_relaxed(&pn->state, vcpu_halted, vcpu_running);
> > >
> > >                 /*
> > >                  * If the locked flag is still not set after wakeup, it is
> > > a
> > > @@ -360,9 +372,12 @@ static void pv_kick_node(struct qspinlock *lock,
> > > struct mcs_spinlock *node)
> > >          * pv_wait_node(). If OTOH this fails, the vCPU was running and
> > > will
> > >          * observe its next->locked value and advance itself.
> > >          *
> > > -        * Matches with smp_store_mb() and cmpxchg() in pv_wait_node()
> > > +        * Matches with smp_store_mb() and cmpxchg_relaxed() in
> > > pv_wait_node().
> > > +        * A release barrier is used here to ensure that node->locked is
> > > +        * always set before changing the state. See comment in
> > > pv_wait_node().
> > >          */
> > > -       if (cmpxchg(&pn->state, vcpu_halted, vcpu_hashed) != vcpu_halted)
> > > +       if (cmpxchg_release(&pn->state, vcpu_halted, vcpu_hashed)
> > > +                       != vcpu_halted)
> > >                 return;
> > >
> > > hi, Waiman
> > We can't use _release here, a full barrier is needed.
> > 
> > There is pv_kick_node vs pv_wait_head_or_lock
> > 
> > [w] l->locked = _Q_SLOW_VAL  //reordered here
> > 
> > if (READ_ONCE(pn->state) == vcpu_hashed) //False.
> > 
> >                    lp = (struct qspinlock **)1;
> > 
> > [STORE] pn->state = vcpu_hashed                        lp = pv_hash(lock,
> > pn);
> > pv_hash()                                                                if
> > (xchg(&l->locked, _Q_SLOW_VAL) == 0) // fasle, not unhashed.
> > 
> 
> This analysis is correct, but..
> 

Hmm.. look at this again, I don't think this analysis is meaningful,
let's say the reordering didn't happen, we still got(similar to your
case):

						if (READ_ONCE(pn->state) == vcpu_hashed) // false.
						  lp = (struct qspinlock **)1;

cmpxchg(pn->state, vcpu_halted, vcpu_hashed);
						  if(!lp) {
						    lp = pv_hash(lock, pn);
WRITE_ONCE(l->locked, _Q_SLOW_VAL);
pv_hash();
						    if (xchg(&l->locked, _Q_SLOW_VAL) == 0) // fasle, not unhashed.

, right?

Actually, I think this or your case could not happen because we have

	cmpxchg(pn->state, vcpu_halted, vcpu_running);

in pv_wait_node(), which makes us either observe vcpu_hashed or set
pn->state to vcpu_running before pv_kick_node() trying to do the hash.

I may miss something subtle, but does switching back to cmpxchg() could
fix the RCU stall you observed?

Regards,
Boqun

> > Then the same lock has hashed twice but only unhashed once. So at last as
> > the hash table grows big, we hit RCU stall.
> > 
> > I hit RCU stall when I run netperf benchmark
> > 
> 
> how will a big hash table hit RCU stall? Do you have the call trace for
> your RCU stall?
> 
> Regards,
> Boqun
> 
> > thanks
> > xinhui
> > 
> > 
> > > --
> > > 1.8.3.1
> > >
> > >



Download attachment "signature.asc" of type "application/pgp-signature" (489 bytes)