linux-kernel - Re: [PATCH] arm64: enable GENERIC_FIND_FIRST

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20210224154416.GA1181413@yury-ThinkPad>
Date:   Wed, 24 Feb 2021 07:44:16 -0800
From:   Yury Norov <yury.norov@...il.com>
To:     Alexander Lobakin <alobakin@...me>
Cc:     Catalin Marinas <catalin.marinas@....com>,
        Will Deacon <will@...nel.org>,
        linux-arm-kernel@...ts.infradead.org, linux-kernel@...r.kernel.org,
        linux-arch@...r.kernel.org, linux-mips@...r.kernel.org
Subject: Re: [PATCH] arm64: enable  GENERIC_FIND_FIRST_BIT

On Wed, Feb 24, 2021 at 11:52:55AM +0000, Alexander Lobakin wrote:
> From: Yury Norov <yury.norov@...il.com>
> Date: Sat, 5 Dec 2020 08:54:06 -0800
> 
> Hi,
> 
> > ARM64 doesn't implement find_first_{zero}_bit in arch code and doesn't
> > enable it in config. It leads to using find_next_bit() which is less
> > efficient:
> >
> > 0000000000000000 <find_first_bit>:
> >    0:	aa0003e4 	mov	x4, x0
> >    4:	aa0103e0 	mov	x0, x1
> >    8:	b4000181 	cbz	x1, 38 <find_first_bit+0x38>
> >    c:	f9400083 	ldr	x3, [x4]
> >   10:	d2800802 	mov	x2, #0x40                  	// #64
> >   14:	91002084 	add	x4, x4, #0x8
> >   18:	b40000c3 	cbz	x3, 30 <find_first_bit+0x30>
> >   1c:	14000008 	b	3c <find_first_bit+0x3c>
> >   20:	f8408483 	ldr	x3, [x4], #8
> >   24:	91010045 	add	x5, x2, #0x40
> >   28:	b50000c3 	cbnz	x3, 40 <find_first_bit+0x40>
> >   2c:	aa0503e2 	mov	x2, x5
> >   30:	eb02001f 	cmp	x0, x2
> >   34:	54ffff68 	b.hi	20 <find_first_bit+0x20>  // b.pmore
> >   38:	d65f03c0 	ret
> >   3c:	d2800002 	mov	x2, #0x0                   	// #0
> >   40:	dac00063 	rbit	x3, x3
> >   44:	dac01063 	clz	x3, x3
> >   48:	8b020062 	add	x2, x3, x2
> >   4c:	eb02001f 	cmp	x0, x2
> >   50:	9a829000 	csel	x0, x0, x2, ls  // ls = plast
> >   54:	d65f03c0 	ret
> >
> >   ...
> >
> > 0000000000000118 <_find_next_bit.constprop.1>:
> >  118:	eb02007f 	cmp	x3, x2
> >  11c:	540002e2 	b.cs	178 <_find_next_bit.constprop.1+0x60>  // b.hs, b.nlast
> >  120:	d346fc66 	lsr	x6, x3, #6
> >  124:	f8667805 	ldr	x5, [x0, x6, lsl #3]
> >  128:	b4000061 	cbz	x1, 134 <_find_next_bit.constprop.1+0x1c>
> >  12c:	f8667826 	ldr	x6, [x1, x6, lsl #3]
> >  130:	8a0600a5 	and	x5, x5, x6
> >  134:	ca0400a6 	eor	x6, x5, x4
> >  138:	92800005 	mov	x5, #0xffffffffffffffff    	// #-1
> >  13c:	9ac320a5 	lsl	x5, x5, x3
> >  140:	927ae463 	and	x3, x3, #0xffffffffffffffc0
> >  144:	ea0600a5 	ands	x5, x5, x6
> >  148:	54000120 	b.eq	16c <_find_next_bit.constprop.1+0x54>  // b.none
> >  14c:	1400000e 	b	184 <_find_next_bit.constprop.1+0x6c>
> >  150:	d346fc66 	lsr	x6, x3, #6
> >  154:	f8667805 	ldr	x5, [x0, x6, lsl #3]
> >  158:	b4000061 	cbz	x1, 164 <_find_next_bit.constprop.1+0x4c>
> >  15c:	f8667826 	ldr	x6, [x1, x6, lsl #3]
> >  160:	8a0600a5 	and	x5, x5, x6
> >  164:	eb05009f 	cmp	x4, x5
> >  168:	540000c1 	b.ne	180 <_find_next_bit.constprop.1+0x68>  // b.any
> >  16c:	91010063 	add	x3, x3, #0x40
> >  170:	eb03005f 	cmp	x2, x3
> >  174:	54fffee8 	b.hi	150 <_find_next_bit.constprop.1+0x38>  // b.pmore
> >  178:	aa0203e0 	mov	x0, x2
> >  17c:	d65f03c0 	ret
> >  180:	ca050085 	eor	x5, x4, x5
> >  184:	dac000a5 	rbit	x5, x5
> >  188:	dac010a5 	clz	x5, x5
> >  18c:	8b0300a3 	add	x3, x5, x3
> >  190:	eb03005f 	cmp	x2, x3
> >  194:	9a839042 	csel	x2, x2, x3, ls  // ls = plast
> >  198:	aa0203e0 	mov	x0, x2
> >  19c:	d65f03c0 	ret
> >
> >  ...
> >
> > 0000000000000238 <find_next_bit>:
> >  238:	a9bf7bfd 	stp	x29, x30, [sp, #-16]!
> >  23c:	aa0203e3 	mov	x3, x2
> >  240:	d2800004 	mov	x4, #0x0                   	// #0
> >  244:	aa0103e2 	mov	x2, x1
> >  248:	910003fd 	mov	x29, sp
> >  24c:	d2800001 	mov	x1, #0x0                   	// #0
> >  250:	97ffffb2 	bl	118 <_find_next_bit.constprop.1>
> >  254:	a8c17bfd 	ldp	x29, x30, [sp], #16
> >  258:	d65f03c0 	ret
> >
> > Enabling this functions would also benefit for_each_{set,clear}_bit().
> > Would it make sense to enable this config for all such architectures by
> > default?
> 
> I confirm that GENERIC_FIND_FIRST_BIT also produces more optimized and
> fast code on MIPS (32 R2) where there is also no architecture-specific
> bitsearching routines.
> So, if it's okay for other folks, I'd suggest to go for it and enable
> for all similar arches.
 
As far as I understand the idea of GENERIC_FIND_FIRST_BIT=n, it's
intended to save some space in .text. But in fact it bloats the
kernel:

        yury:linux$ scripts/bloat-o-meter vmlinux vmlinux.ffb
        add/remove: 4/1 grow/shrink: 19/251 up/down: 564/-1692 (-1128)
        ...

For the next cycle, I'm going to submit a patch that removes the 
GENERIC_FIND_FIRST_BIT completely and forces all architectures to
use find_first{_zero}_bit() 

> (otherwise, I'll publish a separate entry for mips-next after 5.12-rc1
>  release and mention you in "Suggested-by:")

I think it worth to enable GENERIC_FIND_FIRST_BIT for mips and arm now
and see how it works for people. If there'll be no complains I'll remove
the config entirely. I'm OK if you submit the patch for mips now, or we
can make a series and submit together. Works either way.