linux-kernel - Re: [PATCH v2] locking/pvqspinlock: Relax cmpxchg's to improve performance on some archs

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-Id: <1f8d7eaf-ac3b-81b8-0d8f-12f60436cc48@linux.vnet.ibm.com>
Date:   Mon, 13 Feb 2017 10:24:38 +0800
From:   panxinhui <xinhui@...ux.vnet.ibm.com>
To:     Waiman Long <longman@...hat.com>, Xinhui Pan <mnipxh@...il.com>
Cc:     Peter Zijlstra <peterz@...radead.org>,
        Ingo Molnar <mingo@...hat.com>, linux-kernel@...r.kernel.org,
        Boqun Feng <boqun.feng@...il.com>
Subject: Re: [PATCH v2] locking/pvqspinlock: Relax cmpxchg's to improve
 performance on some archs



在 2017/2/10 上午4:53, Waiman Long 写道:
> On 02/07/2017 10:39 PM, Xinhui Pan wrote:
>>
>>
>> 2016-12-26 4:26 GMT+08:00 Waiman Long <longman@...hat.com <mailto:longman@...hat.com>>:
>>
>>     A number of cmpxchg calls in qspinlock_paravirt.h were replaced by more
>>     relaxed versions to improve performance on architectures that use LL/SC.
>>
>>     All the locking related cmpxchg's are replaced with the _acquire
>>     variants:
>>      - pv_queued_spin_steal_lock()
>>      - trylock_clear_pending()
>>
>>     The cmpxchg's related to hashing are replaced by either by the _release
>>     or the _relaxed variants. See the inline comment for details.
>>
>>     Signed-off-by: Waiman Long <longman@...hat.com <mailto:longman@...hat.com>>
>>
>>      v1->v2:
>>       - Add comments in changelog and code for the rationale of the change.
>>
>>     ---
>>      kernel/locking/qspinlock_paravirt.h | 50 ++++++++++++++++++++++++-------------
>>      1 file changed, 33 insertions(+), 17 deletions(-)
>>
>>
>>     @@ -323,8 +329,14 @@ static void pv_wait_node(struct mcs_spinlock *node, struct mcs_spinlock *prev)
>>                      * If pv_kick_node() changed us to vcpu_hashed, retain that
>>                      * value so that pv_wait_head_or_lock() knows to not also try
>>                      * to hash this lock.
>>     +                *
>>     +                * The smp_store_mb() and control dependency above will ensure
>>     +                * that state change won't happen before that. Synchronizing
>>     +                * with pv_kick_node() wrt hashing by this waiter or by the
>>     +                * lock holder is done solely by the state variable. There is
>>     +                * no other ordering requirement.
>>                      */
>>     -               cmpxchg(&pn->state, vcpu_halted, vcpu_running);
>>     +               cmpxchg_relaxed(&pn->state, vcpu_halted, vcpu_running);
>>
>>                     /*
>>                      * If the locked flag is still not set after wakeup, it is a
>>     @@ -360,9 +372,12 @@ static void pv_kick_node(struct qspinlock *lock, struct mcs_spinlock *node)
>>              * pv_wait_node(). If OTOH this fails, the vCPU was running and will
>>              * observe its next->locked value and advance itself.
>>              *
>>     -        * Matches with smp_store_mb() and cmpxchg() in pv_wait_node()
>>     +        * Matches with smp_store_mb() and cmpxchg_relaxed() in pv_wait_node().
>>     +        * A release barrier is used here to ensure that node->locked is
>>     +        * always set before changing the state. See comment in pv_wait_node().
>>              */
>>     -       if (cmpxchg(&pn->state, vcpu_halted, vcpu_hashed) != vcpu_halted)
>>     +       if (cmpxchg_release(&pn->state, vcpu_halted, vcpu_hashed)
>>     +                       != vcpu_halted)
>>                     return;
>>
>> hi, Waiman
>> We can't use _release here, a full barrier is needed.
>>
>> There is pv_kick_node vs pv_wait_head_or_lock
>>
>> [w] l->locked = _Q_SLOW_VAL  //reordered here    
>>                                                                                 if (READ_ONCE(pn->state) == vcpu_hashed) //False.
>>                                                                                                lp = (struct qspinlock **)1;
>>
>> [STORE] pn->state = vcpu_hashed                        lp = pv_hash(lock, pn);
>> pv_hash()                                                                if (xchg(&l->locked, _Q_SLOW_VAL) == 0) // fasle, not unhashed.
>>
>> Then the same lock has hashed twice but only unhashed once. So at last as the hash table grows big, we hit RCU stall.
>>
>> I hit RCU stall when I run netperf benchmark
>>
>> thanks
>> xinhui
>>
>>
>>     --
>>     1.8.3.1
>>
>>
> Yes, I know I am being too aggressive in this patch. I am going to tone it down a bit. I just don't have time to run a performance test on PPC system to verify the gain yet. I am planning to send an updated patch soon.
> 
hi, All

I guess I have found the scenario that causes the RCU stall.

pv_wait_node					
										[L] pn->state // this load is reordered from cmxchg_release.
	smp_store_mb(pn->state, vcpu_halted);					
	if (!READ_ONCE(node->locked))						
										arch_mcs_spin_unlock_contended(&next->locked);
										pv_kick_node
											[-L]cmpxchg_release(&pn->state, vcpu_halted, vcpu_hashed)
											//cmpxchg_release fails, so pn->state keep as it is.
		pv_wait(&pn->state, vcpu_halted);
		//on PPC, It will not return until pn->state != vcpu_halted.
		
And when rcu stall hit, I fire an BUG(), and enter debug mode, it seems most cpus are in pv_wait...

So the soltuion to solve this problems is simple, keep the cmpxchg as it is in pv_kick_node, cmpxchg on ppc provides full barriers.

thanks
xinhui

> Cheers,
> Longman
>