[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20260111235745.53d16a59@pumpkin>
Date: Sun, 11 Jan 2026 23:57:45 +0000
From: David Laight <david.laight.linux@...il.com>
To: "Maciej W. Rozycki" <macro@...am.me.uk>
Cc: Yury Norov <ynorov@...dia.com>, Petr Tesarik <ptesarik@...e.com>, Yury
Norov <yury.norov@...il.com>, Rasmus Villemoes <linux@...musvillemoes.dk>,
Richard Henderson <richard.henderson@...aro.org>, Matt Turner
<mattst88@...il.com>, Magnus Lindholm <linmag7@...il.com>, Vineet Gupta
<vgupta@...nel.org>, Geert Uytterhoeven <geert@...ux-m68k.org>, Thomas
Bogendoerfer <tsbogend@...ha.franken.de>, Madhavan Srinivasan
<maddy@...ux.ibm.com>, Michael Ellerman <mpe@...erman.id.au>, Heiko
Carstens <hca@...ux.ibm.com>, Vasily Gorbik <gor@...ux.ibm.com>, Alexander
Gordeev <agordeev@...ux.ibm.com>, Chris Zankel <chris@...kel.net>, Max
Filippov <jcmvbkbc@...il.com>, Patrik Jakobsson
<patrik.r.jakobsson@...il.com>, Maarten Lankhorst
<maarten.lankhorst@...ux.intel.com>, Maxime Ripard <mripard@...nel.org>,
Thomas Zimmermann <tzimmermann@...e.de>, David Airlie <airlied@...il.com>,
Simona Vetter <simona@...ll.ch>, Robin Murphy <robin.murphy@....com>, Joerg
Roedel <joro@...tes.org>, Will Deacon <will@...nel.org>, Jakub Kicinski
<kuba@...nel.org>, Andrew Lunn <andrew+netdev@...n.ch>, "David S. Miller"
<davem@...emloft.net>, Eric Dumazet <edumazet@...gle.com>, Paolo Abeni
<pabeni@...hat.com>, Oliver Neukum <oliver@...kum.org>, Arnd Bergmann
<arnd@...db.de>, Kuan-Wei Chiu <visitorckw@...il.com>, Andrew Morton
<akpm@...ux-foundation.org>, Marcel Holtmann <marcel@...tmann.org>, Johan
Hedberg <johan.hedberg@...il.com>, Luiz Augusto von Dentz
<luiz.dentz@...il.com>, Pablo Neira Ayuso <pablo@...filter.org>, Florian
Westphal <fw@...len.de>, linux-kernel@...r.kernel.org
Subject: Re: [RFC PATCH 2/2] treewide, bits: use ffs_val() where it is
open-coded
On Sun, 11 Jan 2026 21:22:03 +0000 (GMT)
"Maciej W. Rozycki" <macro@...am.me.uk> wrote:
> On Sun, 11 Jan 2026, David Laight wrote:
>
> > > It's 4 MIPS instructions and exactly the same machine code produced as
> > > compiler output from `is_power_of_2'.
> >
> > The compiler must be detecting that x == (x & -x) is the same as
> > (x & (x - 1)) == 0.
>
> There are 3 basic operations either way, so unless a CPU has an odd
> instruction that combines any, there's no way to reduce the instruction
> count.
>
> > Although for MIPS the former is negate, and, beq(x,y) - so 3 instructions.
> > On x86 it is negate, and, cmp, beq(zero flag) - one extra instruction.
> > (The 4th MIPS instruction will be beq(syn,0).)
>
> This code only ever runs on R2000/R3000 and R4000/R4400 processors, on
> which the cost of an untaken branch is nil. The cost of a taken branch on
> the former ones is nil too, with their 4-stage pipelines, and while there
> is a fixed penalty on the latter ones, the compiler should be able to
> arrange code here such that at most one branch is taken anyway.
>
> > So maybe s/Badly/In an unusual way/
>
> What's unusual in this way of determining that exactly one bit is set in
> a word? It's as good as any I suppose.
>
> > > If you know of a better alternative
> >
> > Doing is_power_of_2() with only one conditional branch would be a gain.
> > But I've not thought about it hard enough to find one.
>
> Sure you can, either of:
>
> (x != 0) & (x == (x & -x))
>
> and:
>
> (x != 0) & ((x & (x - 1)) == 0)
>
> will do, but it's 5 instructions on MIPS, so no gain here.
>
> If you have a population count instruction such as CTPOP on EV67 Alpha,
> you can do it in two instructions, such as:
>
> ctpop $16, $0
> cmpeq $0, 1, $0
>
> so there seems to be room for improvement here, but GCC stubbornly refuses
> to produce this instruction sequence for this code:
>
> bool is_power_of_2(unsigned long n)
> {
> return __builtin_popcountl(n) == 1;
> }
>
> apparently owing to no insn cost set for the POPCOUNT RTX in the backend
> causing the middle end to pick up some default. Instead GCC comes up with
> what can be coded in C as:
>
> bool is_power_of_2(unsigned long n)
> {
> return n - 1 < (n ^ (n - 1));
> }
Not seen that one - and not thought of popcnt, far too new for me.
(I learnt asm for a pdp11 first...)
> which seems an interesting alternative that produces 3 instructions only
> across Alpha, MIPS and RISC-V targets. It's 5 instructions on POWER and
> x86-64 vs 8 and 9 respectively with our current implementation. There are
> no branches produced here, although an inline expansion will likely end
> with one.
I'm seeing:
is_power_of_2:
leaq -1(%rdi), %rax
xorq %rax, %rdi
cmpq %rdi, %rax
setb %al
retq
on x86-64 - which is really 3 instructions.
While using the popcnt instruction would reduce it by one, it will only
be faster on amd zen cpu, on intel cpu popcnt has a latency of 3 and only
executes on p1 (according to Agner).
>
> Overall it seems a worthwhile improvement even if still an instruction
> longer than necessary for EV67 Alpha (and also for POWER, which also has
> a population count operation, and which GCC does pick up in this case).
>
> FWIW on VAX, notorious for the lack of conditional set operations, it's 7
> instructions and 1 branch, down from 14 and 2 respectively.
I worked for ICL through the 1980s so missed out on VAX.
Picked up 8085, x86, sparc32 and m68k.
> > I suspect there may not be one, but all these 'tricks' come from the 1970s
> > (or earlier) and don't allow for the behaviour of modern cpu with multiple
> > execution units and long pipelines.
>
> As we can see you never know. I've sent a code update proposal.
I'd suggest moving is_power_of_2() to bits.h as:
#define is_power_of_2(n) ({ auto _n = n; _n - 1 < _n ^ (_n - 1); })
and add:
#define is_zero_or_power_of_2(n) ({ auto _n = n; _n & (_n - 1) != 0; })
>
> The use of the population count operation can be considered separately,
> but we'd have to be careful here as the quality of code produced by the
> intrinsic varies, e.g. with pre-EV67 Alpha, RISC-V and VAX (!) GCC emits
> code equivalent to my proposal, but with MIPS or x86-64 it resorts to a
> libcall, so a guarding condition would be required. So I'd leave it for
> another occasion.
I'd guess that many of those popcnt instructions aren't actually as fast
as the dec/xor pair.
But the libcall is just plain stupid.
I think that happens elsewhere - which is one reason __ffs() is x86 asm.
>
> Thanks for sparking this discussion overall!
Nice wet Sunday bikeshed...
David
>
> Maciej
>
Powered by blists - more mailing lists