[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAHk-=wgga5hrg5qgm4UwuOCjgBrobhZcdKTr1AFU7WSWgVKKZQ@mail.gmail.com>
Date: Thu, 28 Jul 2022 14:49:24 -0700
From: Linus Torvalds <torvalds@...ux-foundation.org>
To: Nick Desaulniers <ndesaulniers@...gle.com>,
Nathan Chancellor <nathan@...nel.org>
Cc: Linux Kernel Mailing List <linux-kernel@...r.kernel.org>
Subject: Re: [PATCH 0/5] lib/find: optimize find_bit() functions
On Thu, Jul 28, 2022 at 11:49 AM Linus Torvalds
<torvalds@...ux-foundation.org> wrote:
>
> It builds for me and seems to generate reasonable code, although I
> notice that clang messes up the "__ffs()" inline asm and forces the
> source into memory.
I have created a llvm issue for this at
https://github.com/llvm/llvm-project/issues/56789
and while I noticed this while looking at the rather odd code
generation for the bit finding functions, it seems to be a general
issue with clang inline asm.
It looks like any instruction that takes a mod/rm input (so a register
or memory) will always force the thing to be in memory. Which is very
pointless in itself, but it actually causes some functions to have a
stack frame that they wouldn't otherwise need or want. So it actually
has secondary downsides too.
And yes, that particular case could be solved with __builtin_ctzl(),
which seems to DTRT. But that uses plain bsf, and we seem to really
want tzcnt ("rep bsf") here, although I didn't check why (the comment
explicitly says "Undefined if no bit exists", which is the main
difference between bsf and tzcnt).
I _think_ it's because tzcnt is faster when it exists exactly because
it always writes the destination, so 'bsf' is actually the inferior
op, and clang shouldn't generate it.
But the "rm" thing exists elsewhere too, and I just checked - this
same issue seems to happen with "g" too (ie "any general integer
input").
Linus
Powered by blists - more mailing lists