[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20160421161354.GI3430@twins.programming.kicks-ass.net>
Date: Thu, 21 Apr 2016 18:13:54 +0200
From: Peter Zijlstra <peterz@...radead.org>
To: Pan Xinhui <xinhui@...ux.vnet.ibm.com>
Cc: linux-kernel@...r.kernel.org, linuxppc-dev@...ts.ozlabs.org,
benh@...nel.crashing.org, paulus@...ba.org, mpe@...erman.id.au,
boqun.feng@...il.com, paulmck@...ux.vnet.ibm.com,
tglx@...utronix.de
Subject: Re: [PATCH V3] powerpc: Implement {cmp}xchg for u8 and u16
On Thu, Apr 21, 2016 at 11:35:07PM +0800, Pan Xinhui wrote:
> yes, you are right. more load/store will be done in C code.
> However such xchg_u8/u16 is just used by qspinlock now. and I did not see any performance regression.
> So just wrote in C, for simple. :)
Which is fine; but worthy of a note in your Changelog.
> Of course I have done xchg tests.
> we run code just like xchg((u8*)&v, j++); in several threads.
> and the result is,
> [ 768.374264] use time[1550072]ns in xchg_u8_asm
> [ 768.377102] use time[2826802]ns in xchg_u8_c
>
> I think this is because there is one more load in C.
> If possible, we can move such code in asm-generic/.
So I'm not actually _that_ familiar with the PPC LL/SC implementation;
but there are things a CPU can do to optimize these loops.
For example, a CPU might choose to not release the exclusive hold of the
line for a number of cycles, except when it passes SC or an interrupt
happens. This way there's a smaller chance the SC fails and inhibits
forward progress.
By doing the modification outside of the LL/SC you loose such
advantages.
And yes, doing a !exclusive load prior to the exclusive load leads to an
even bigger window where the data can get changed out from under you.
Powered by blists - more mailing lists