lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20150212234603.30908.qmail@ns.horizon.com>
Date:	12 Feb 2015 18:46:03 -0500
From:	"George Spelvin" <linux@...izon.com>
To:	linux@...izon.com, linux@...musvillemoes.dk
Cc:	akpm@...ux-foundation.org, chris@...is-wilson.co.uk,
	davem@...emloft.net, dborkman@...hat.com,
	hannes@...essinduktion.org, klimov.linux@...il.com,
	laijs@...fujitsu.com, linux-kernel@...r.kernel.org,
	msalter@...hat.com, takahiro.akashi@...aro.org, tgraf@...g.ch,
	valentinrothberg@...il.com, yury.norov@...il.com
Subject: Re: [PATCH v3 1/3] lib: find_*_bit reimplementation

> That, and if the compiler was smart enough, the AND should actually be
> free (at least on x86), since it can replace a test instruction. [1]

Ah, right, x86 loads don't set the flags, so you need a TEST
instruction anyway.  So the AND is free!

Of course, not all the world's an x86.  But load/store architectures
generally need a test instruction as well.  It's things like MIPS and
Aloha, that don't have condition codes, but only conditional branches
on register values, that will need an extra cycle.

> In any case, my code compiles to fewer bytes (though not an entire cache
> line), and I think it is somewhat clearer - I'm obviously biased on the
> latter point.

Myself, I think the code that shows that only the first word gets masked
by control flow is clearer, but since the complexity is completely
localized to this function, I'll take smaller and faster.

> [1] Unfortunately, the compiler doesn't seem to be smart enough :-( When I
> compile my code with gcc 5.0, I get
> 
> 0000000000000000 <find_last_bit_rv>:
>    0:   53                      push   %rbx
>    1:   89 f1                   mov    %esi,%ecx
>    3:   48 8d 5e 3f             lea    0x3f(%rsi),%rbx
>    7:   f7 d9                   neg    %ecx
>    9:   48 c7 c2 ff ff ff ff    mov    $0xffffffffffffffff,%rdx
>   10:   48 83 e3 c0             and    $0xffffffffffffffc0,%rbx
>   14:   48 d3 ea                shr    %cl,%rdx
>   17:   eb 1a                   jmp    33 <find_last_bit_rv+0x33>
>   19:   0f 1f 80 00 00 00 00    nopl   0x0(%rax)
> 
>   20:   48 89 d1                mov    %rdx,%rcx
>   23:   48 23 0c df             and    (%rdi,%rbx,8),%rcx
>   27:   48 c7 c2 ff ff ff ff    mov    $0xffffffffffffffff,%rdx
>   2e:   48 85 c9                test   %rcx,%rcx
>   31:   75 15                   jne    48 <find_last_bit_rv+0x48>
>   33:   48 83 eb 01             sub    $0x1,%rbx
>   37:   48 83 fb ff             cmp    $0xffffffffffffffff,%rbx
>   3b:   75 e3                   jne    20 <find_last_bit_rv+0x20>
> 
>   3d:   48 89 f0                mov    %rsi,%rax
>   40:   5b                      pop    %rbx
>   41:   c3                      retq   
>   42:   66 0f 1f 44 00 00       nopw   0x0(%rax,%rax,1)
>   48:   48 89 cf                mov    %rcx,%rdi
>   4b:   48 c1 e3 06             shl    $0x6,%rbx
>   4f:   e8 00 00 00 00          callq  54 <find_last_bit_rv+0x54>
>   54:   48 98                   cltq   
>   56:   48 01 d8                add    %rbx,%rax
>   59:   5b                      pop    %rbx
>   5a:   c3                      retq   
>   5b:   0f 1f 44 00 00          nopl   0x0(%rax,%rax,1)

> the main loop is 20--3b. The test instruction at 2e seems to be
> redundant. The same at 37: the sub instruction already sets plenty of
> flags that could be used, so explicitly comparing %rbx to -1 seems
> redundant.

Er... I think you hand-edited that code; it's wrong.  The loop assumes that
%rbx is in units of words, but the prologue sets it up in units of bits.

The mov to %rcx is also redundant, since it could be eliminated with some
minor rescheduling.

The code generation I *want* for that function is:

# addr in %rdi, size in %rsi
	movl	%esi, %ecx
	leaq	0x3f(%rsi), %rax
	negl	%ecx
	movq	$-1, %rdx
        shrq	$6, %rax
	shrq	%cl, %rdx
	jmp	2f
1:
	movq	$-1, %rdx
2:
	subq	$1, %rax
	jc	3f
	andq	(%rdi,%rax,8), %rdx
	jeq	1b

	bsrq	%rdx, %rdx
        salq    $6, %rax
	addq	%rdx, %rax
        ret
3:
	movq	%rsi, %rax
	retq

I wonder if the compiler can be convinced by a bit of source tweaking?

unsigned long find_last_bit(const unsigned long *addr, unsigned long size)
{
	size_t idx = DIV_ROUND_UP(size, BITS_PER_LONG);
	unsigned long val = LAST_WORD_MASK(size);

	while (idx--) {
		val &= addr[idx];
		if (val)
			return idx * BITS_PER_LONG + __fls(val);
		val = ~0ul;
	}
	return size;
}

Darn, no difference...
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ