linux-kernel - Re: Word-at-a-time dcache name accesses (was Re: .. anybody know of any filesystems that depend on the exact VFS 'namehash' implementation?)

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Date:	Sun, 4 Mar 2012 21:38:31 -0800
From:	Linus Torvalds <torvalds@...ux-foundation.org>
To:	Jason Garrett-Glaser <jason@...4.com>
Cc:	Andi Kleen <andi@...stfloor.org>, "H. Peter Anvin" <hpa@...or.com>,
	Linux Kernel Mailing List <linux-kernel@...r.kernel.org>,
	linux-fsdevel <linux-fsdevel@...r.kernel.org>,
	Al Viro <viro@...iv.linux.org.uk>
Subject: Re: Word-at-a-time dcache name accesses (was Re: .. anybody know of
 any filesystems that depend on the exact VFS 'namehash' implementation?)

On Sun, Mar 4, 2012 at 7:58 PM, Jason Garrett-Glaser <jason@...4.com> wrote:
>
> There is an improvement you can make to this.  "bsf" is microcoded in
> many future CPUs (e.g. Piledriver) in favor of tzcnt, which has
> slightly different flag behavior and no undefined behavior and is part
> of BMI1.

So I've gotten rid of 'bsf' because it really does have problems on
many CPU's. It's disgustingly slow on some older CPU's.

I asked around on G+ to see if that would be useful, and there's a
nice simple four-instruction sequence for the 32-bit case using just
trivial operations (one shift, one and, a couple of adds).

For the 64-bit case, the bsf can be replaced with a single multiply
and shift. The bsf is still better on some CPU's, but the single
multiply and shift is more consistently good - and as long as it
doesn't stall the CPU, we're good, because the end result of it all
won't be used until several cycles later.

So my current patch is attached - it does depend on the current -git
tree having moved dentry_cmp() into fs/dcache.c, so it's on top of
*tonights* -git tree, but this is something I'm pretty happy with, and
was planning on actually committing early in the 3.4 merge window.

My profiling seems to show that the multiply is pretty much free on
64-bit at least on the cpu's I have access to - it's not like a
multiply is free, but I do suspect it gets hidden very well by any OoO
instruction scheduling.

A bit-count instructions (popcount or bsf or tzcnt is obviously in
*theory* less work than a 64-bit multiply, but the multiply is
"portable". Even if it isn't optimal, it shouldn't be horrible on any
64-bit capable x86 CPU, and it also means (for example) that the code
might even work on non-x86 chips.

I did only very limited profiling of the 32-bit case, but it's really
just four cheap ALU instructions there and didn't really show up at
all in the limited profiles I did. And at least I checked that the
code worked. I have to say that the advantage of "vectorizing" this
code is obviously much less if you can only do 4-byte "vectors", so I
didn't actually time whether the patch *improves* anything on x86-32.

                                     Linus

View attachment "patch.diff" of type "text/x-patch" (5700 bytes)