linux-kernel - Re: [PATCH 0/5] lib/find: optimize find

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <CAHk-=wiFJboHXZEXtMhDcHbEYXujTiDxkGbpONC=DJA7dJG6nw@mail.gmail.com>
Date:   Thu, 28 Jul 2022 11:49:24 -0700
From:   Linus Torvalds <torvalds@...ux-foundation.org>
To:     Yury Norov <yury.norov@...il.com>
Cc:     Guenter Roeck <linux@...ck-us.net>,
        Dennis Zhou <dennis@...nel.org>,
        Russell King <linux@...linux.org.uk>,
        Catalin Marinas <catalin.marinas@....com>,
        Andy Shevchenko <andriy.shevchenko@...ux.intel.com>,
        Rasmus Villemoes <linux@...musvillemoes.dk>,
        Alexey Klimov <aklimov@...hat.com>,
        Linux Kernel Mailing List <linux-kernel@...r.kernel.org>
Subject: Re: [PATCH 0/5] lib/find: optimize find_bit() functions

On Thu, Jul 28, 2022 at 9:12 AM Yury Norov <yury.norov@...il.com> wrote:
>
> In the recent discussion [1], it was noticed that find_next_bit()
> functions may be improved by adding wrappers around common
> __find_next_bit().

So looking at the end result, this generates fairly good code.

I say "fairly good" because _find_next_and_bit() ends up being disgusting.

The reason? That

        if (addr2)
                tmp &= addr2[start / BITS_PER_LONG];

generates horrific code when 'addr2' isn't seen to be always NULL.

So this doesn't affect the regular _find_next_bit case, because the
inliner sees that addr2 is always NULL and doesn't generate extra code
for it.

But the code that actually *does* have two incoming bitmasks will now
pointlessly test that second bitmask pointer all the time.

Now, that particular function probably doesn't matter, but this code
seems to be unnecessarily written to try to be overly generic, and
that

 (a) makes it hard to read the source code because the source code
doesn't do the obvious thing

 (b) clearly generates suboptimal code too

so I'm really not a huge fan of this "try to share code". Not when the
resulting shared code is harder to read, and impossible for the
compiler to do a great job at.

And the code sharing really doesn't help all that much.

If you really want to generate good code, use macros, and share the
logic that way. Not hugely readable either, but I think it's actually
not bad.

I think something like this would work:

    #define BIT_FIND_SETUP(addr, size, start)                   \
        unsigned long val, idx;                                 \
        if (unlikely(start >= size))                            \
                return size;                                    \
        idx = start / BITS_PER_LONG;

    #define BIT_FIND_FIRST_WORD(addr, size, start, EXPRESSION)  \
        if (!IS_ALIGNED(start, BITS_PER_LONG)) {                \
                unsigned long mask;                             \
                mask = BITMAP_FIRST_WORD_MASK(start);           \
                if ((val = mask & (EXPRESSION)) != 0)           \
                        goto found;                             \
                idx++;                                          \
        }

    #define BIT_WORD_LOOP(ptr, size, idx, val, EXPRESSION)              \
        do {                                                    \
                if ((val = (EXPRESSION)) != 0)                  \
                         goto found;                            \
                idx++;                                          \
        } while ((idx)*BITS_PER_LONG < (size));

    #define BIT_FIND_BODY(addr, size, start, EXPRESSION)                \
        BIT_FIND_SETUP(addr, size, start)                       \
        BIT_FIND_FIRST_WORD(addr, size, start, EXPRESSION)      \
        BIT_WORD_LOOP(addr, size, idx, val, EXPRESSION) \
        return size;                                            \
    found:      BIT_WORD_SWAB(val);                                     \
        return min((idx)*BITS_PER_LONG + __ffs(val), size)

    #define BIT_WORD_SWAB(x) /* Nothing */

and then for the BE versions you just use the same macros, but you
make BIT_WORD_SWAB() do the swab.

I'm attaching an UNTESTED version of lib/find_bit.c that works the
above way (note: it is based on your header file changes!)

It builds for me and seems to generate reasonable code, although I
notice that clang messes up the "__ffs()" inline asm and forces the
source into memory.

(Gcc doesn't do that, but gcc follows the code exactly and generates
"shl $6" instructions, while clang seems to figure out that it can
just add 64 instead)

                        Linus

View attachment "find_bit.c" of type "text/x-csrc" (3385 bytes)