[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20260112132320.1a1ddbd7@pumpkin>
Date: Mon, 12 Jan 2026 13:23:20 +0000
From: David Laight <david.laight.linux@...il.com>
To: "Maciej W. Rozycki" <macro@...am.me.uk>
Cc: Yury Norov <ynorov@...dia.com>, Petr Tesarik <ptesarik@...e.com>, Yury
Norov <yury.norov@...il.com>, Rasmus Villemoes <linux@...musvillemoes.dk>,
Richard Henderson <richard.henderson@...aro.org>, Matt Turner
<mattst88@...il.com>, Magnus Lindholm <linmag7@...il.com>, Vineet Gupta
<vgupta@...nel.org>, Geert Uytterhoeven <geert@...ux-m68k.org>, Thomas
Bogendoerfer <tsbogend@...ha.franken.de>, Madhavan Srinivasan
<maddy@...ux.ibm.com>, Michael Ellerman <mpe@...erman.id.au>, Heiko
Carstens <hca@...ux.ibm.com>, Vasily Gorbik <gor@...ux.ibm.com>, Alexander
Gordeev <agordeev@...ux.ibm.com>, Chris Zankel <chris@...kel.net>, Max
Filippov <jcmvbkbc@...il.com>, Patrik Jakobsson
<patrik.r.jakobsson@...il.com>, Maarten Lankhorst
<maarten.lankhorst@...ux.intel.com>, Maxime Ripard <mripard@...nel.org>,
Thomas Zimmermann <tzimmermann@...e.de>, David Airlie <airlied@...il.com>,
Simona Vetter <simona@...ll.ch>, Robin Murphy <robin.murphy@....com>, Joerg
Roedel <joro@...tes.org>, Will Deacon <will@...nel.org>, Jakub Kicinski
<kuba@...nel.org>, Andrew Lunn <andrew+netdev@...n.ch>, "David S. Miller"
<davem@...emloft.net>, Eric Dumazet <edumazet@...gle.com>, Paolo Abeni
<pabeni@...hat.com>, Oliver Neukum <oliver@...kum.org>, Arnd Bergmann
<arnd@...db.de>, Kuan-Wei Chiu <visitorckw@...il.com>, Andrew Morton
<akpm@...ux-foundation.org>, Marcel Holtmann <marcel@...tmann.org>, Johan
Hedberg <johan.hedberg@...il.com>, Luiz Augusto von Dentz
<luiz.dentz@...il.com>, Pablo Neira Ayuso <pablo@...filter.org>, Florian
Westphal <fw@...len.de>, linux-kernel@...r.kernel.org
Subject: Re: [RFC PATCH 2/2] treewide, bits: use ffs_val() where it is
open-coded
On Mon, 12 Jan 2026 11:21:59 +0000 (GMT)
"Maciej W. Rozycki" <macro@...am.me.uk> wrote:
> On Sun, 11 Jan 2026, David Laight wrote:
>
> > > If you have a population count instruction such as CTPOP on EV67 Alpha,
> > > you can do it in two instructions, such as:
> > >
> > > ctpop $16, $0
> > > cmpeq $0, 1, $0
> > >
> > > so there seems to be room for improvement here, but GCC stubbornly refuses
> > > to produce this instruction sequence for this code:
> > >
> > > bool is_power_of_2(unsigned long n)
> > > {
> > > return __builtin_popcountl(n) == 1;
> > > }
> > >
> > > apparently owing to no insn cost set for the POPCOUNT RTX in the backend
> > > causing the middle end to pick up some default. Instead GCC comes up with
> > > what can be coded in C as:
> > >
> > > bool is_power_of_2(unsigned long n)
> > > {
> > > return n - 1 < (n ^ (n - 1));
> > > }
> >
> > Not seen that one - and not thought of popcnt, far too new for me.
> > (I learnt asm for a pdp11 first...)
>
> It does show up as an RTL operation in the compiler; it's a good way to
> discover stuff.
>
> > > which seems an interesting alternative that produces 3 instructions only
> > > across Alpha, MIPS and RISC-V targets. It's 5 instructions on POWER and
> > > x86-64 vs 8 and 9 respectively with our current implementation. There are
> > > no branches produced here, although an inline expansion will likely end
> > > with one.
> >
> > I'm seeing:
> > is_power_of_2:
> > leaq -1(%rdi), %rax
> > xorq %rax, %rdi
> > cmpq %rdi, %rax
> > setb %al
> > retq
> > on x86-64 - which is really 3 instructions.
>
> Ack, this is `unsigned long' vs `bool' return type. I'm just too used to
> RISC psABIs. FWIW I have this:
>
> leaq -1(%rdi), %rax
> xorq %rax, %rdi
> cmpq %rdi, %rax
> setb %al
> movzbl %al, %eax
> ret
>
> in the former case and I take it `bool' is 8-bit on x86.
That was _Bool on godbolt, it you return 'int' the extra 'zero extend'
is needed. SETcc was added for 386 (I still forget about it) and someone
assumed you wouldn't want the 'word' variant.
(It isn't as though they couldn't have allocated another row of the 0F
opcode to it.)
I sort of hate 'bool' types, something sane has to happen for 'a & b'
when the bit patterns are unexpected - and it ought to be well defined.
The traditional C where 'a > b' generates 0 or 1 is fine.
> > While using the popcnt instruction would reduce it by one, it will only
> > be faster on amd zen cpu, on intel cpu popcnt has a latency of 3 and only
> > executes on p1 (according to Agner).
>
> Ack. I've lost track of x86 after ~Pentium Pro.
Those were the days :-)
I do remember something from one of the Intel manuals of that period about
not intermixing data with code because of the possibility of the cpu
speculatively executing the data following a branch and if the data happened
to be a floating point trig function it had to wait for it to finish.
SLS isn't a modern thing.
> > > > I suspect there may not be one, but all these 'tricks' come from the 1970s
> > > > (or earlier) and don't allow for the behaviour of modern cpu with multiple
> > > > execution units and long pipelines.
> > >
> > > As we can see you never know. I've sent a code update proposal.
> >
> > I'd suggest moving is_power_of_2() to bits.h as:
> > #define is_power_of_2(n) ({ auto _n = n; _n - 1 < _n ^ (_n - 1); })
> > and add:
> > #define is_zero_or_power_of_2(n) ({ auto _n = n; _n & (_n - 1) != 0; })
>
> I'll leave it as an exercise for someone else.
I might add it to the patchset I'm writing - mostly bits.h
> > > The use of the population count operation can be considered separately,
> > > but we'd have to be careful here as the quality of code produced by the
> > > intrinsic varies, e.g. with pre-EV67 Alpha, RISC-V and VAX (!) GCC emits
> > > code equivalent to my proposal, but with MIPS or x86-64 it resorts to a
> > > libcall, so a guarding condition would be required. So I'd leave it for
> > > another occasion.
> >
> > I'd guess that many of those popcnt instructions aren't actually as fast
> > as the dec/xor pair.
>
> With EV67 Alpha and POWER being proper RISC architectures I'd expect the
> instruction to be implemented using a dedicated logic gate network. After
> all the population count operation is just an n-operand addition of 1-bit
> inputs, so not a big deal. The propagation delay is probably on the same
> order of magnitude as with a 2-operand n-bit adder, so the operation ought
> to fit in a single pipeline stage.
There are some tricks that can be done to reduce the propagation delay
of a 64bit add from ~64 (for the ripple carry) to nearer 16 (add each byte
with and without a carry input and then use a mux to select the correct
output). That may not work for popcnt.
Also I suspect that multiply isn't single clock, so multicycle instructions
have to be supported.
Throw silicon at multiply and you can reduce it to full-adders with a carry-
chain the length of the result, not sure you can do much better.
For my sins I re-implemented the Nios-II cpu a couple of years ago (because
Intel had deprecated it), the single 64bit add needed at the end of multiply
(done by fgpa DSP blocks) made it impossible to feed the product straight
back into the ALU for the next cycle.
My version is about the same size, 1 clock faster in a few places (like
predicted taken branches and the stall if a multiply result (and I think
memory reads) is needed in the next-but-one instruction.
FMax is a bit lower, but we were constrained by the PCIe interface to 67.5MHz
and it was fast enough.
David
>
> I suppose it's just a matter of whether you want to dedicate a piece of
> silicon just for this somewhat exotic operation, and in the old days the
> answer would be no for a general-purpose CPU owing to manufacturing cost
> and die size limitation, while nowadays with process shrinkage and power
> consumption reduction it may have become worthwhile.
>
> > But the libcall is just plain stupid.
>
> It could just be that it's a recent addition as my compilation of GCC for
> x86-64 and MIPS turns out to be version 11 and 13 respectively, while I
> have version 15 for the remaining targets. Indeed when I tried now GCC 14
> that I have at hand for another MIPS configuration the libcall is gone, so
> the optimisation must have been added in that version.
>
> Of course you still need a libcall for a plain `__builtin_popcountl' call
> where there's no suitable hardware instruction available.
>
> Maciej
>
Powered by blists - more mailing lists