lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:	Mon, 9 Jul 2007 12:26:23 -0700 (PDT)
From:	Linus Torvalds <torvalds@...ux-foundation.org>
To:	Davide Libenzi <davidel@...ilserver.org>
cc:	Nick Piggin <npiggin@...e.de>,
	Linux Kernel Mailing List <linux-kernel@...r.kernel.org>
Subject: Re: queued spinlock code and results



On Mon, 9 Jul 2007, Davide Libenzi wrote:
> 
> So in this box, and in this test, the double-short Z-lock seems faster 
> than a double-byte. I've no idea why, since it uses two ops more and an 
> extra register.

At this kind of level, the exact instruction scheduling can make a big 
difference.

The extra register usage won't matter if there is no register pressure, 
and any extra instructions can actually happen to *help*, if they end up 
just aligning something just the right way.

There can also be various random effects of prefixes: decoding x86 
instructions is basically a very uarch-specific issue, and for all we know 
it might be that the AMD setup may well end up behaving differently from 
most Intel chips (and within the Intel family, the netburst situation is 
likely different from the other P6-derived cores).

For example, does a single prefix decode faster? It could be that the 
combination of "lock" _and_ "opsize" prefixes is problematic (as in a 
16-bit locked "lock xaddw"), and causes a decode hickup, but that "lock" 
and "opsize" on their own don't cause any decoder issues (ie doing the 
"lock" on the 32-bit xadd, and just the "opsize" prefix on the 16-bit decw 
both are fast).

But on another uarch it might work out the other way: if "lock" is always 
a complex op, then having a opsize prefix on that one might be "free", and 
then you're better combining them for the locked 16-bit xadd, and having 
the releasing "decb" not have any prefix at all.

And regardless of that, just a random "it happened to get aligned that 
way" (where "alignment" might be about hitting the cache-line just right, 
but might also be about just having the right instruction mix to get the 
intel decoders to run at their full 4-1-1-1 capacity), causing the timing 
differences.

So before taking these numbers as any kind of "real" values, I'd suggest:

 - trying it out on at least a few different uarchs (Opteron, P4 and Core 
   2 all have quite different restrictions on decoding)

 - possibly trying it out with things in different order and different 
   compiler options (-O2 vs -Os), trying to cause different kinds of 
   alignment issues.

Also, just a small nit: in the kernel, the locking would _not_ be inlined 
(but the unlocking would), so marking the lock functions "inline" is 
probably a bad idea. Without the inline, it's likely more realistic, and 
the effects of register pressure will be hidden. Because of the uninlining 
nature of locks, I think you can generally ignore the "one or two 
registers" issue - you'll have three caller-clobbered registers to play 
with regardless.

			Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ