[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20140707143357.GR6758@twins.programming.kicks-ass.net>
Date: Mon, 7 Jul 2014 16:33:57 +0200
From: Peter Zijlstra <peterz@...radead.org>
To: Konrad Rzeszutek Wilk <konrad.wilk@...cle.com>
Cc: Waiman.Long@...com, tglx@...utronix.de, mingo@...nel.org,
linux-arch@...r.kernel.org, linux-kernel@...r.kernel.org,
virtualization@...ts.linux-foundation.org,
xen-devel@...ts.xenproject.org, kvm@...r.kernel.org,
paolo.bonzini@...il.com, boris.ostrovsky@...cle.com,
paulmck@...ux.vnet.ibm.com, riel@...hat.com,
torvalds@...ux-foundation.org, raghavendra.kt@...ux.vnet.ibm.com,
david.vrabel@...rix.com, oleg@...hat.com, gleb@...hat.com,
scott.norton@...com, chegu_vinod@...com
Subject: Re: [PATCH 05/11] qspinlock: Optimize for smaller NR_CPUS
On Wed, Jun 18, 2014 at 11:57:30AM -0400, Konrad Rzeszutek Wilk wrote:
> On Sun, Jun 15, 2014 at 02:47:02PM +0200, Peter Zijlstra wrote:
> > From: Peter Zijlstra <peterz@...radead.org>
> >
> > When we allow for a max NR_CPUS < 2^14 we can optimize the pending
> > wait-acquire and the xchg_tail() operations.
> >
> > By growing the pending bit to a byte, we reduce the tail to 16bit.
> > This means we can use xchg16 for the tail part and do away with all
> > the repeated compxchg() operations.
> >
> > This in turn allows us to unconditionally acquire; the locked state
> > as observed by the wait loops cannot change. And because both locked
> > and pending are now a full byte we can use simple stores for the
> > state transition, obviating one atomic operation entirely.
>
> I have to ask - how much more performance do you get from this?
>
> Is this extra atomic operation hurting that much?
Its not extra, its a cmpxchg loop vs an unconditional xchg.
And yes, its somewhat tedious to show, but on 4 socket systems you can
really see it make a difference. I'll try and run some numbers, I need
to reinstall the box.
(there were numbers in the previous threads, but you're right, I
should've put some in the Changelog).
> > /**
> > * queue_spin_lock_slowpath - acquire the queue spinlock
> > @@ -173,8 +259,13 @@ void queue_spin_lock_slowpath(struct qsp
> > * we're pending, wait for the owner to go away.
> > *
> > * *,1,1 -> *,1,0
> > + *
> > + * this wait loop must be a load-acquire such that we match the
> > + * store-release that clears the locked bit and create lock
> > + * sequentiality; this because not all clear_pending_set_locked()
> > + * implementations imply full barriers.
> > */
> > - while ((val = atomic_read(&lock->val)) & _Q_LOCKED_MASK)
> > + while ((val = smp_load_acquire(&lock->val.counter)) & _Q_LOCKED_MASK)
>
> lock->val.counter? Ugh, all to deal with the 'int' -> 'u32' (or 'u64')
No, to do atomic_t -> int.
> Could you introduce a macro in atomic.h called 'atomic_read_raw' which
> would do the this? Like this:
That would be worse I think. It looks like a function returning an
rvalue whereas we really want an lvalue.
Content of type "application/pgp-signature" skipped
Powered by blists - more mailing lists