linux-kernel - Re: Word-at-a-time dcache name accesses (was Re: .. anybody know of any filesystems that depend on the exact VFS 'namehash' implementation?)

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <CA+55aFyMn+gYh2uRPJSAS5n4UbQoY2iqe3peYjVosrP-73oQVA@mail.gmail.com>
Date:	Sat, 3 Mar 2012 12:10:09 -0800
From:	Linus Torvalds <torvalds@...ux-foundation.org>
To:	Andi Kleen <andi@...stfloor.org>, "H. Peter Anvin" <hpa@...or.com>
Cc:	Linux Kernel Mailing List <linux-kernel@...r.kernel.org>,
	linux-fsdevel <linux-fsdevel@...r.kernel.org>,
	Al Viro <viro@...iv.linux.org.uk>
Subject: Re: Word-at-a-time dcache name accesses (was Re: .. anybody know of
 any filesystems that depend on the exact VFS 'namehash' implementation?)

On Fri, Mar 2, 2012 at 3:46 PM, Linus Torvalds
<torvalds@...ux-foundation.org> wrote:
>
> This *does* assume that "bsf" is a reasonably fast instruction, which is
> not necessarily the case especially on 32-bit x86. So the config option
> choice for this might want some tuning even on x86, but it would be lovely
> to get comments and have people test it out on older hardware.

Ok, so I was thinking about this. I can replace the "bsf" with a
multiply, and I just wonder which one is faster.

> +       /* Get the final path component length */
> +       len += __ffs(mask) >> 3;
> +
> +       /* The mask *below* the first high bit set */
> +       mask = (mask - 1) & ~mask;
> +       mask >>= 7;
> +       hash += a & mask;

So instead of the __ffs() on the original mask (to find the first byte
with the high bit set), I could use the "mask of bytes" and some math
to get the number of bytes set like this (so this goes at the end,
*after* we used the mask to mask off the bytes in 'a' - not where the
__ffs() is right now):

    /* Low bits set in each byte we used as a mask */
    mask &= ONEBYTES;

    /* Add up "mask + (mask<<8) + (mask<<16) +... ":
       same as a multiply */
    mask *= ONEBYTES;

    /* High byte now contains count of bits set */
    len += mask >> 8*(sizeof(unsigned long)-1);

which I find intriguing because it just continues with the whole
"bitmask tricks" thing and even happens to re-use one of the bitmasks
we already had.

On machines with slow bit scanning (and a good multiplier), that might
be faster.

Sadly, it's a multiply with a big constant. Yes, we could make the
constant smaller by not counting the highest byte: it is never set, so
we could use "ONEBYTES>>8" and shift right by 8*sizeof(unsigned
long)-2) instead, but it's still not as cheap as just doing adds and
masks.

I can't come up with anything really cheap to calculate "number of
bytes set". But the above may be cheaper than the bsf on some older
32-bit machines that have horrible bit scanning performance (eg Atom
or P4). An integer multiply tends to be around four cycles, the bsf
performance is all over the map (2-17 cycles latency).

                         Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/