lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <alpine.DEB.2.21.2601120056570.30566@angie.orcam.me.uk>
Date: Mon, 12 Jan 2026 11:21:59 +0000 (GMT)
From: "Maciej W. Rozycki" <macro@...am.me.uk>
To: David Laight <david.laight.linux@...il.com>
cc: Yury Norov <ynorov@...dia.com>, Petr Tesarik <ptesarik@...e.com>, 
    Yury Norov <yury.norov@...il.com>, 
    Rasmus Villemoes <linux@...musvillemoes.dk>, 
    Richard Henderson <richard.henderson@...aro.org>, 
    Matt Turner <mattst88@...il.com>, Magnus Lindholm <linmag7@...il.com>, 
    Vineet Gupta <vgupta@...nel.org>, 
    Geert Uytterhoeven <geert@...ux-m68k.org>, 
    Thomas Bogendoerfer <tsbogend@...ha.franken.de>, 
    Madhavan Srinivasan <maddy@...ux.ibm.com>, 
    Michael Ellerman <mpe@...erman.id.au>, Heiko Carstens <hca@...ux.ibm.com>, 
    Vasily Gorbik <gor@...ux.ibm.com>, 
    Alexander Gordeev <agordeev@...ux.ibm.com>, 
    Chris Zankel <chris@...kel.net>, Max Filippov <jcmvbkbc@...il.com>, 
    Patrik Jakobsson <patrik.r.jakobsson@...il.com>, 
    Maarten Lankhorst <maarten.lankhorst@...ux.intel.com>, 
    Maxime Ripard <mripard@...nel.org>, 
    Thomas Zimmermann <tzimmermann@...e.de>, David Airlie <airlied@...il.com>, 
    Simona Vetter <simona@...ll.ch>, Robin Murphy <robin.murphy@....com>, 
    Joerg Roedel <joro@...tes.org>, Will Deacon <will@...nel.org>, 
    Jakub Kicinski <kuba@...nel.org>, Andrew Lunn <andrew+netdev@...n.ch>, 
    "David S. Miller" <davem@...emloft.net>, 
    Eric Dumazet <edumazet@...gle.com>, Paolo Abeni <pabeni@...hat.com>, 
    Oliver Neukum <oliver@...kum.org>, Arnd Bergmann <arnd@...db.de>, 
    Kuan-Wei Chiu <visitorckw@...il.com>, 
    Andrew Morton <akpm@...ux-foundation.org>, 
    Marcel Holtmann <marcel@...tmann.org>, 
    Johan Hedberg <johan.hedberg@...il.com>, 
    Luiz Augusto von Dentz <luiz.dentz@...il.com>, 
    Pablo Neira Ayuso <pablo@...filter.org>, Florian Westphal <fw@...len.de>, 
    linux-kernel@...r.kernel.org
Subject: Re: [RFC PATCH 2/2] treewide, bits: use ffs_val() where it is
 open-coded

On Sun, 11 Jan 2026, David Laight wrote:

> >  If you have a population count instruction such as CTPOP on EV67 Alpha, 
> > you can do it in two instructions, such as:
> > 
> > 	ctpop	$16, $0
> > 	cmpeq	$0, 1, $0
> > 
> > so there seems to be room for improvement here, but GCC stubbornly refuses 
> > to produce this instruction sequence for this code:
> > 
> > bool is_power_of_2(unsigned long n)
> > {
> > 	return __builtin_popcountl(n) == 1;
> > }
> > 
> > apparently owing to no insn cost set for the POPCOUNT RTX in the backend 
> > causing the middle end to pick up some default.  Instead GCC comes up with 
> > what can be coded in C as:
> > 
> > bool is_power_of_2(unsigned long n)
> > {
> > 	return n - 1 < (n ^ (n - 1));
> > }
> 
> Not seen that one - and not thought of popcnt, far too new for me.
> (I learnt asm for a pdp11 first...)

 It does show up as an RTL operation in the compiler; it's a good way to 
discover stuff.

> > which seems an interesting alternative that produces 3 instructions only 
> > across Alpha, MIPS and RISC-V targets.  It's 5 instructions on POWER and 
> > x86-64 vs 8 and 9 respectively with our current implementation.  There are 
> > no branches produced here, although an inline expansion will likely end 
> > with one.
> 
> I'm seeing:
> is_power_of_2:
>         leaq    -1(%rdi), %rax
>         xorq    %rax, %rdi
>         cmpq    %rdi, %rax
>         setb    %al
>         retq
> on x86-64 - which is really 3 instructions.

 Ack, this is `unsigned long' vs `bool' return type.  I'm just too used to 
RISC psABIs.  FWIW I have this:

	leaq	-1(%rdi), %rax
	xorq	%rax, %rdi
	cmpq	%rdi, %rax
	setb	%al
	movzbl	%al, %eax
	ret

in the former case and I take it `bool' is 8-bit on x86.

> While using the popcnt instruction would reduce it by one, it will only
> be faster on amd zen cpu, on intel cpu popcnt has a latency of 3 and only
> executes on p1 (according to Agner).

 Ack.  I've lost track of x86 after ~Pentium Pro.

> > > I suspect there may not be one, but all these 'tricks' come from the 1970s
> > > (or earlier) and don't allow for the behaviour of modern cpu with multiple
> > > execution units and long pipelines.  
> > 
> >  As we can see you never know.  I've sent a code update proposal.
> 
> I'd suggest moving is_power_of_2() to bits.h as:
> #define is_power_of_2(n) ({ auto _n = n; _n - 1 < _n ^ (_n - 1); })
> and add:
> #define is_zero_or_power_of_2(n) ({ auto _n = n; _n & (_n - 1) != 0; })

 I'll leave it as an exercise for someone else.

> >  The use of the population count operation can be considered separately, 
> > but we'd have to be careful here as the quality of code produced by the 
> > intrinsic varies, e.g. with pre-EV67 Alpha, RISC-V and VAX (!) GCC emits 
> > code equivalent to my proposal, but with MIPS or x86-64 it resorts to a 
> > libcall, so a guarding condition would be required.  So I'd leave it for 
> > another occasion.
> 
> I'd guess that many of those popcnt instructions aren't actually as fast
> as the dec/xor pair.

 With EV67 Alpha and POWER being proper RISC architectures I'd expect the 
instruction to be implemented using a dedicated logic gate network.  After 
all the population count operation is just an n-operand addition of 1-bit 
inputs, so not a big deal.  The propagation delay is probably on the same 
order of magnitude as with a 2-operand n-bit adder, so the operation ought 
to fit in a single pipeline stage.

 I suppose it's just a matter of whether you want to dedicate a piece of
silicon just for this somewhat exotic operation, and in the old days the 
answer would be no for a general-purpose CPU owing to manufacturing cost 
and die size limitation, while nowadays with process shrinkage and power 
consumption reduction it may have become worthwhile.

> But the libcall is just plain stupid.

 It could just be that it's a recent addition as my compilation of GCC for 
x86-64 and MIPS turns out to be version 11 and 13 respectively, while I 
have version 15 for the remaining targets.  Indeed when I tried now GCC 14 
that I have at hand for another MIPS configuration the libcall is gone, so 
the optimisation must have been added in that version.

 Of course you still need a libcall for a plain `__builtin_popcountl' call 
where there's no suitable hardware instruction available.

  Maciej

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ