lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:	Tue, 27 Oct 2009 17:58:53 -0700 (PDT)
From:	Linus Torvalds <torvalds@...ux-foundation.org>
To:	Stephen Hemminger <shemminger@...tta.com>
cc:	Eric Dumazet <eric.dumazet@...il.com>,
	Stephen Hemminger <stephen.hemminger@...tta.com>,
	Andrew Morton <akpm@...ux-foundation.org>,
	Octavian Purdila <opurdila@...acom.com>,
	netdev@...r.kernel.org, linux-kernel@...r.kernel.org,
	Al Viro <viro@...iv.linux.org.uk>
Subject: Re: [PATCH] dcache: better name hash function



On Tue, 27 Oct 2009, Stephen Hemminger wrote:
> 
> Agreed. Here is the reduced version of the program.
> To run:
>   find /home -printf '%f\n' 2>/dev/null | ./htest -n 100

The timings are very sensitive to random I$ layout at least on Nehalem. 
The reason seems to be that the inner loop is _so_ tight that just 
depending on exactly where the loop ends up, you can get subtle 
interactions with the loop cache.

Look here:

	[torvalds@...alem ~]$ find /home -printf '%f\n' 2>/dev/null | ./htest -n 100
	Algorithm             Time       Ratio       Max   StdDev
	full_name_hash       1.141899     1.03      4868 263.37
	djb2                 0.980200     1.03      4835 266.05
	string10             0.909175     1.03      4850 262.67
	string10a            0.673915     1.03      4850 262.67
	string10b            0.909374     1.03      4850 262.67
	string_hash17        0.966050     1.03      4805 263.68
	string_hash31        1.008544     1.03      4807 259.37
	fnv32                0.774806     1.03      4817 259.17

what do you think the difference between 'string10', 'string10a' and 
'string10b' are?

None. None what-so-ever. The source code is identical, and gcc generates 
identical assembly language. Yet those timings are extremely stable for 
me, and 'string10b' is 25% faster than the identical string10 and 
string10a functions.

The only difference? 'string10a' starts aligned to just 16 bytes, but that 
in turn happens to mean that the tight inner loop ends up aligned on a 
128-byte boundary. And being cacheline aligned just there seems to matters 
for some subtle micro-architectural reason.

The reason I noticed this is that I wondered what small modifications to 
'string10' would do for performance, and noticed that even _without_ the 
small modifications, performance fluctuated.

Lesson? Microbenchmarks like this can be dangerous and misleading. That's 
_especially_ true if the loop ends up being just tight enough that it can 
fit in some trace cache or similar. In real life, the name hash is 
performance-critical, but at the same time almost certainly won't be run 
in a tight enough loop that you'd ever notice things like that.

		Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ