linux-kernel - Re: queued spinlock code and results

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <alpine.LFD.0.999.0707091212410.3412@woody.linux-foundation.org>
Date:	Mon, 9 Jul 2007 12:26:23 -0700 (PDT)
From:	Linus Torvalds <torvalds@...ux-foundation.org>
To:	Davide Libenzi <davidel@...ilserver.org>
cc:	Nick Piggin <npiggin@...e.de>,
	Linux Kernel Mailing List <linux-kernel@...r.kernel.org>
Subject: Re: queued spinlock code and results

On Mon, 9 Jul 2007, Davide Libenzi wrote:
> 
> So in this box, and in this test, the double-short Z-lock seems faster 
> than a double-byte. I've no idea why, since it uses two ops more and an 
> extra register.

At this kind of level, the exact instruction scheduling can make a big 
difference.

The extra register usage won't matter if there is no register pressure, 
and any extra instructions can actually happen to *help*, if they end up 
just aligning something just the right way.

There can also be various random effects of prefixes: decoding x86 
instructions is basically a very uarch-specific issue, and for all we know 
it might be that the AMD setup may well end up behaving differently from 
most Intel chips (and within the Intel family, the netburst situation is 
likely different from the other P6-derived cores).

For example, does a single prefix decode faster? It could be that the 
combination of "lock" _and_ "opsize" prefixes is problematic (as in a 
16-bit locked "lock xaddw"), and causes a decode hickup, but that "lock" 
and "opsize" on their own don't cause any decoder issues (ie doing the 
"lock" on the 32-bit xadd, and just the "opsize" prefix on the 16-bit decw 
both are fast).

But on another uarch it might work out the other way: if "lock" is always 
a complex op, then having a opsize prefix on that one might be "free", and 
then you're better combining them for the locked 16-bit xadd, and having 
the releasing "decb" not have any prefix at all.

And regardless of that, just a random "it happened to get aligned that 
way" (where "alignment" might be about hitting the cache-line just right, 
but might also be about just having the right instruction mix to get the 
intel decoders to run at their full 4-1-1-1 capacity), causing the timing 
differences.

So before taking these numbers as any kind of "real" values, I'd suggest:

 - trying it out on at least a few different uarchs (Opteron, P4 and Core 
   2 all have quite different restrictions on decoding)

 - possibly trying it out with things in different order and different 
   compiler options (-O2 vs -Os), trying to cause different kinds of 
   alignment issues.

Also, just a small nit: in the kernel, the locking would _not_ be inlined 
(but the unlocking would), so marking the lock functions "inline" is 
probably a bad idea. Without the inline, it's likely more realistic, and 
the effects of register pressure will be hidden. Because of the uninlining 
nature of locks, I think you can generally ignore the "one or two 
registers" issue - you'll have three caller-clobbered registers to play 
with regardless.

			Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/