lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:   Mon, 1 Feb 2021 23:02:00 -0800
From:   Yury Norov <yury.norov@...il.com>
To:     Andy Shevchenko <andriy.shevchenko@...ux.intel.com>
Cc:     David Laight <David.Laight@...lab.com>,
        "linux-m68k@...ts.linux-m68k.org" <linux-m68k@...ts.linux-m68k.org>,
        "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
        "linux-sh@...r.kernel.org" <linux-sh@...r.kernel.org>,
        "linux-arch@...r.kernel.org" <linux-arch@...r.kernel.org>,
        Geert Uytterhoeven <geert@...ux-m68k.org>,
        Yoshinori Sato <ysato@...rs.sourceforge.jp>,
        Rich Felker <dalias@...c.org>, Arnd Bergmann <arnd@...db.de>,
        Dennis Zhou <dennis@...nel.org>,
        Andrew Morton <akpm@...ux-foundation.org>,
        Wolfram Sang <wsa+renesas@...g-engineering.com>,
        David Sterba <dsterba@...e.com>,
        Stefano Brivio <sbrivio@...hat.com>,
        "Ma, Jianpeng" <jianpeng.ma@...el.com>,
        Wei Yang <richard.weiyang@...ux.alibaba.com>,
        Josh Poimboeuf <jpoimboe@...hat.com>,
        John Paul Adrian Glaubitz <glaubitz@...sik.fu-berlin.de>,
        Joe Perches <joe@...ches.com>,
        Alexey Klimov <aklimov@...hat.com>,
        Rasmus Villemoes <linux@...musvillemoes.dk>
Subject: Re: [PATCH 7/8] lib: add fast path for find_next_*_bit()

[ CC Alexey Klimov, andRasmus Villemoes as they also may be interested ]

On Mon, Feb 1, 2021 at 8:22 AM Andy Shevchenko
<andriy.shevchenko@...ux.intel.com> wrote:

> On Mon, Feb 01, 2021 at 04:02:30PM +0000, David Laight wrote:
> > From: Andy Shevchenko
> > > Sent: 01 February 2021 13:49
> > > On Sat, Jan 30, 2021 at 11:17:18AM -0800, Yury Norov wrote:
> > > > Similarly to bitmap functions, find_next_*_bit() users will benefit
> > > > if we'll handle a case of bitmaps that fit into a single word. In the
> > > > very best case, the compiler may replace a function call with a
> > > > single ffs or ffz instruction.
> > >
> > > Would be nice to have the examples how it reduces the actual code size (based
> > > on the existing code in kernel, especially in widely used frameworks /
> > > subsystems, like PCI).
>>
> > I bet it makes the kernel bigger but very slightly faster.
> > But the fact that the wrappers end up in the i-cache may
> > mean that inlining actually makes it slower for some calling
> > sequences.
>
> > If a bitmap fits in a single word (as a compile-time constant)
> > then you should (probably) be using different functions if
> > you care about performance.
>
> Isn't this patch series exactly about it?

Probably, David meant something like find_first_bit_fast_path().
Not sure that this approach would be better than pure inlining.

Addressing questions above.

My arm64 build for qemu:
Before:
Idx Name          Size      VMA               LMA               File off  Algn
  0 .head.text    00010000  ffff800010000000  ffff800010000000  00010000  2**16
                  CONTENTS, ALLOC, LOAD, READONLY, CODE
  1 .text         011603a8  ffff800010010000  ffff800010010000  00020000  2**16
                  CONTENTS, ALLOC, LOAD, READONLY, CODE
  2 .got.plt      00000018  ffff8000111703a8  ffff8000111703a8  011803a8  2**3
                  CONTENTS, ALLOC, LOAD, DATA
  3 .rodata       007a29b2  ffff800011180000  ffff800011180000  01190000  2**12
                  CONTENTS, ALLOC, LOAD, DATA
  4 .pci_fixup    000025f0  ffff8000119229c0  ffff8000119229c0  019329c0  2**4
                  CONTENTS, ALLOC, LOAD, READONLY, DATA
  5 __ksymtab     000100d4  ffff800011924fb0  ffff800011924fb0  01934fb0  2**2
                  CONTENTS, ALLOC, LOAD, READONLY, DATA
  6 __ksymtab_gpl 000147cc  ffff800011935084  ffff800011935084  01945084  2**2
                  CONTENTS, ALLOC, LOAD, READONLY, DATA
  7 __ksymtab_strings 0003b0be  ffff800011949850  ffff800011949850
01959850  2**0
                  CONTENTS, ALLOC, LOAD, READONLY, DATA
  8 __param       00003e58  ffff800011984910  ffff800011984910  01994910  2**3
                  CONTENTS, ALLOC, LOAD, READONLY, DATA
  9 __modver      00000678  ffff800011988768  ffff800011988768  01998768  2**3
                  CONTENTS, ALLOC, LOAD, DATA
 10 __ex_table    00002270  ffff800011988de0  ffff800011988de0  01998de0  2**3
                  CONTENTS, ALLOC, LOAD, READONLY, DATA
 11 .notes        0000003c  ffff80001198b050  ffff80001198b050  0199b050  2**2
                  CONTENTS, ALLOC, LOAD, READONLY, DATA
 12 .init.text    0007b09c  ffff8000119a0000  ffff8000119a0000  019a0000  2**2
                  CONTENTS, ALLOC, LOAD, READONLY, CODE
 13 .exit.text    00004120  ffff800011a1b09c  ffff800011a1b09c  01a1b09c  2**2
                  CONTENTS, ALLOC, LOAD, READONLY, CODE
...

After:
Idx Name          Size      VMA               LMA               File off  Algn
  0 .head.text    00010000  ffff800010000000  ffff800010000000  00010000  2**16
                  CONTENTS, ALLOC, LOAD, READONLY, CODE
  1 .text         011613a8  ffff800010010000  ffff800010010000  00020000  2**16
                  CONTENTS, ALLOC, LOAD, READONLY, CODE
  2 .got.plt      00000018  ffff8000111713a8  ffff8000111713a8  011813a8  2**3
                  CONTENTS, ALLOC, LOAD, DATA
  3 .rodata       007a26b2  ffff800011180000  ffff800011180000  01190000  2**12
                  CONTENTS, ALLOC, LOAD, DATA
  4 .pci_fixup    000025f0  ffff8000119226c0  ffff8000119226c0  019326c0  2**4
                  CONTENTS, ALLOC, LOAD, READONLY, DATA
  5 __ksymtab     000100bc  ffff800011924cb0  ffff800011924cb0  01934cb0  2**2
                  CONTENTS, ALLOC, LOAD, READONLY, DATA
  6 __ksymtab_gpl 000147cc  ffff800011934d6c  ffff800011934d6c  01944d6c  2**2
                  CONTENTS, ALLOC, LOAD, READONLY, DATA
  7 __ksymtab_strings 0003b09b  ffff800011949538  ffff800011949538
01959538  2**0
                  CONTENTS, ALLOC, LOAD, READONLY, DATA
  8 __param       00003e58  ffff8000119845d8  ffff8000119845d8  019945d8  2**3
                  CONTENTS, ALLOC, LOAD, READONLY, DATA
  9 __modver      00000678  ffff800011988430  ffff800011988430  01998430  2**3
                  CONTENTS, ALLOC, LOAD, DATA
 10 __ex_table    00002270  ffff800011988aa8  ffff800011988aa8  01998aa8  2**3
                  CONTENTS, ALLOC, LOAD, READONLY, DATA
 11 .notes        0000003c  ffff80001198ad18  ffff80001198ad18  0199ad18  2**2
                  CONTENTS, ALLOC, LOAD, READONLY, DATA
 12 .init.text    0007b1a8  ffff8000119a0000  ffff8000119a0000  019a0000  2**2
                  CONTENTS, ALLOC, LOAD, READONLY, CODE
 13 .exit.text    00004120  ffff800011a1b1a8  ffff800011a1b1a8  01a1b1a8  2**2
                  CONTENTS, ALLOC, LOAD, READONLY, CODE

The size of  .text is increased by 0x1000 bytes, or 0.02%
The size of .rodata is decreased by 0x300 bytes, or 0.06%
The size of  __ksymtab is decreased by 24 bytes, or 0.018%

So the cost of this fast path is ~3.3K.

Regarding code generation, this is the quite typical find_bit() user:

unsigned int cpumask_next(int n, const struct cpumask *srcp)
{
        /* -1 is a legal arg here. */
        if (n != -1)
                cpumask_check(n);
        return find_next_bit(cpumask_bits(srcp), nr_cpumask_bits, n + 1);
}
EXPORT_SYMBOL(cpumask_next);

Before:
0000000000000000 <cpumask_next>:
   0:   a9bf7bfd        stp     x29, x30, [sp, #-16]!
   4:   11000402        add     w2, w0, #0x1
   8:   aa0103e0        mov     x0, x1
   c:   d2800401        mov     x1, #0x40                       // #64
  10:   910003fd        mov     x29, sp
  14:   93407c42        sxtw    x2, w2
  18:   94000000        bl      0 <find_next_bit>
  1c:   a8c17bfd        ldp     x29, x30, [sp], #16
  20:   d65f03c0        ret
  24:   d503201f        nop

After:
0000000000000140 <cpumask_next>:
 140:   11000400        add     w0, w0, #0x1
 144:   93407c00        sxtw    x0, w0
 148:   f100fc1f        cmp     x0, #0x3f
 14c:   54000168        b.hi    178 <cpumask_next+0x38>  // b.pmore
 150:   f9400023        ldr     x3, [x1]
 154:   92800001        mov     x1, #0xffffffffffffffff         // #-1
 158:   9ac02020        lsl     x0, x1, x0
 15c:   52800802        mov     w2, #0x40                       // #64
 160:   8a030001        and     x1, x0, x3
 164:   dac00020        rbit    x0, x1
 168:   f100003f        cmp     x1, #0x0
 16c:   dac01000        clz     x0, x0
 170:   1a800040        csel    w0, w2, w0, eq  // eq = none
 174:   d65f03c0        ret
 178:   52800800        mov     w0, #0x40                       // #64
 17c:   d65f03c0        ret

find_next_bit() call is removed with the cost of 6 instructions.
(And I suspect we can improve the GENMASK() for better code generation.)
find_next_bit() itself is 41 instructions, so I believe this fast path
would bring
value for many users.

On the other hand, if someone is limited with I-cache and wants to avoid
generating fast paths, I think it's worth to make SMALL_CONST() configurable,
which would also disable fast paths in lib/bitmap.c. We may rely on -Os flag, or
enable fast path in config explicitly:

diff --git a/include/asm-generic/bitsperlong.h
b/include/asm-generic/bitsperlong.h
index 0eeb77544f1d..209e531074c1 100644
--- a/include/asm-generic/bitsperlong.h
+++ b/include/asm-generic/bitsperlong.h
@@ -23,6 +23,10 @@
 #define BITS_PER_LONG_LONG 64
 #endif

+#ifdef CONFIG_FAST_PATH
 #define SMALL_CONST(n) (__builtin_constant_p(n) && (unsigned long)(n)
< BITS_PER_LONG)
+#else
+#define SMALL_CONST(n) (0)
+#endif

 #endif /* __ASM_GENERIC_BITS_PER_LONG */
diff --git a/lib/Kconfig b/lib/Kconfig
index a38cc61256f1..c9629b3ebce8 100644
--- a/lib/Kconfig
+++ b/lib/Kconfig
@@ -39,6 +39,14 @@ config PACKING

          When in doubt, say N.

+config FAST_PATH
+       bool "Enable fast path code generation"
+       default y
+       help
+         This option enables fast path optimization with the cost
+         of increasing the text section.
+
 config BITREVERSE
        tristate

> With Best Regards,
> Andy Shevchenko

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ