[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <alpine.LFD.0.999.0707091212410.3412@woody.linux-foundation.org>
Date: Mon, 9 Jul 2007 12:26:23 -0700 (PDT)
From: Linus Torvalds <torvalds@...ux-foundation.org>
To: Davide Libenzi <davidel@...ilserver.org>
cc: Nick Piggin <npiggin@...e.de>,
Linux Kernel Mailing List <linux-kernel@...r.kernel.org>
Subject: Re: queued spinlock code and results
On Mon, 9 Jul 2007, Davide Libenzi wrote:
>
> So in this box, and in this test, the double-short Z-lock seems faster
> than a double-byte. I've no idea why, since it uses two ops more and an
> extra register.
At this kind of level, the exact instruction scheduling can make a big
difference.
The extra register usage won't matter if there is no register pressure,
and any extra instructions can actually happen to *help*, if they end up
just aligning something just the right way.
There can also be various random effects of prefixes: decoding x86
instructions is basically a very uarch-specific issue, and for all we know
it might be that the AMD setup may well end up behaving differently from
most Intel chips (and within the Intel family, the netburst situation is
likely different from the other P6-derived cores).
For example, does a single prefix decode faster? It could be that the
combination of "lock" _and_ "opsize" prefixes is problematic (as in a
16-bit locked "lock xaddw"), and causes a decode hickup, but that "lock"
and "opsize" on their own don't cause any decoder issues (ie doing the
"lock" on the 32-bit xadd, and just the "opsize" prefix on the 16-bit decw
both are fast).
But on another uarch it might work out the other way: if "lock" is always
a complex op, then having a opsize prefix on that one might be "free", and
then you're better combining them for the locked 16-bit xadd, and having
the releasing "decb" not have any prefix at all.
And regardless of that, just a random "it happened to get aligned that
way" (where "alignment" might be about hitting the cache-line just right,
but might also be about just having the right instruction mix to get the
intel decoders to run at their full 4-1-1-1 capacity), causing the timing
differences.
So before taking these numbers as any kind of "real" values, I'd suggest:
- trying it out on at least a few different uarchs (Opteron, P4 and Core
2 all have quite different restrictions on decoding)
- possibly trying it out with things in different order and different
compiler options (-O2 vs -Os), trying to cause different kinds of
alignment issues.
Also, just a small nit: in the kernel, the locking would _not_ be inlined
(but the unlocking would), so marking the lock functions "inline" is
probably a bad idea. Without the inline, it's likely more realistic, and
the effects of register pressure will be hidden. Because of the uninlining
nature of locks, I think you can generally ignore the "one or two
registers" issue - you'll have three caller-clobbered registers to play
with regardless.
Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists