[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <alpine.DEB.2.21.2601111450500.30566@angie.orcam.me.uk>
Date: Sun, 11 Jan 2026 21:22:03 +0000 (GMT)
From: "Maciej W. Rozycki" <macro@...am.me.uk>
To: David Laight <david.laight.linux@...il.com>
cc: Yury Norov <ynorov@...dia.com>, Petr Tesarik <ptesarik@...e.com>,
Yury Norov <yury.norov@...il.com>,
Rasmus Villemoes <linux@...musvillemoes.dk>,
Richard Henderson <richard.henderson@...aro.org>,
Matt Turner <mattst88@...il.com>, Magnus Lindholm <linmag7@...il.com>,
Vineet Gupta <vgupta@...nel.org>,
Geert Uytterhoeven <geert@...ux-m68k.org>,
Thomas Bogendoerfer <tsbogend@...ha.franken.de>,
Madhavan Srinivasan <maddy@...ux.ibm.com>,
Michael Ellerman <mpe@...erman.id.au>, Heiko Carstens <hca@...ux.ibm.com>,
Vasily Gorbik <gor@...ux.ibm.com>,
Alexander Gordeev <agordeev@...ux.ibm.com>,
Chris Zankel <chris@...kel.net>, Max Filippov <jcmvbkbc@...il.com>,
Patrik Jakobsson <patrik.r.jakobsson@...il.com>,
Maarten Lankhorst <maarten.lankhorst@...ux.intel.com>,
Maxime Ripard <mripard@...nel.org>,
Thomas Zimmermann <tzimmermann@...e.de>, David Airlie <airlied@...il.com>,
Simona Vetter <simona@...ll.ch>, Robin Murphy <robin.murphy@....com>,
Joerg Roedel <joro@...tes.org>, Will Deacon <will@...nel.org>,
Jakub Kicinski <kuba@...nel.org>, Andrew Lunn <andrew+netdev@...n.ch>,
"David S. Miller" <davem@...emloft.net>,
Eric Dumazet <edumazet@...gle.com>, Paolo Abeni <pabeni@...hat.com>,
Oliver Neukum <oliver@...kum.org>, Arnd Bergmann <arnd@...db.de>,
Kuan-Wei Chiu <visitorckw@...il.com>,
Andrew Morton <akpm@...ux-foundation.org>,
Marcel Holtmann <marcel@...tmann.org>,
Johan Hedberg <johan.hedberg@...il.com>,
Luiz Augusto von Dentz <luiz.dentz@...il.com>,
Pablo Neira Ayuso <pablo@...filter.org>, Florian Westphal <fw@...len.de>,
linux-kernel@...r.kernel.org
Subject: Re: [RFC PATCH 2/2] treewide, bits: use ffs_val() where it is
open-coded
On Sun, 11 Jan 2026, David Laight wrote:
> > It's 4 MIPS instructions and exactly the same machine code produced as
> > compiler output from `is_power_of_2'.
>
> The compiler must be detecting that x == (x & -x) is the same as
> (x & (x - 1)) == 0.
There are 3 basic operations either way, so unless a CPU has an odd
instruction that combines any, there's no way to reduce the instruction
count.
> Although for MIPS the former is negate, and, beq(x,y) - so 3 instructions.
> On x86 it is negate, and, cmp, beq(zero flag) - one extra instruction.
> (The 4th MIPS instruction will be beq(syn,0).)
This code only ever runs on R2000/R3000 and R4000/R4400 processors, on
which the cost of an untaken branch is nil. The cost of a taken branch on
the former ones is nil too, with their 4-stage pipelines, and while there
is a fixed penalty on the latter ones, the compiler should be able to
arrange code here such that at most one branch is taken anyway.
> So maybe s/Badly/In an unusual way/
What's unusual in this way of determining that exactly one bit is set in
a word? It's as good as any I suppose.
> > If you know of a better alternative
>
> Doing is_power_of_2() with only one conditional branch would be a gain.
> But I've not thought about it hard enough to find one.
Sure you can, either of:
(x != 0) & (x == (x & -x))
and:
(x != 0) & ((x & (x - 1)) == 0)
will do, but it's 5 instructions on MIPS, so no gain here.
If you have a population count instruction such as CTPOP on EV67 Alpha,
you can do it in two instructions, such as:
ctpop $16, $0
cmpeq $0, 1, $0
so there seems to be room for improvement here, but GCC stubbornly refuses
to produce this instruction sequence for this code:
bool is_power_of_2(unsigned long n)
{
return __builtin_popcountl(n) == 1;
}
apparently owing to no insn cost set for the POPCOUNT RTX in the backend
causing the middle end to pick up some default. Instead GCC comes up with
what can be coded in C as:
bool is_power_of_2(unsigned long n)
{
return n - 1 < (n ^ (n - 1));
}
which seems an interesting alternative that produces 3 instructions only
across Alpha, MIPS and RISC-V targets. It's 5 instructions on POWER and
x86-64 vs 8 and 9 respectively with our current implementation. There are
no branches produced here, although an inline expansion will likely end
with one.
Overall it seems a worthwhile improvement even if still an instruction
longer than necessary for EV67 Alpha (and also for POWER, which also has
a population count operation, and which GCC does pick up in this case).
FWIW on VAX, notorious for the lack of conditional set operations, it's 7
instructions and 1 branch, down from 14 and 2 respectively.
> I suspect there may not be one, but all these 'tricks' come from the 1970s
> (or earlier) and don't allow for the behaviour of modern cpu with multiple
> execution units and long pipelines.
As we can see you never know. I've sent a code update proposal.
The use of the population count operation can be considered separately,
but we'd have to be careful here as the quality of code produced by the
intrinsic varies, e.g. with pre-EV67 Alpha, RISC-V and VAX (!) GCC emits
code equivalent to my proposal, but with MIPS or x86-64 it resorts to a
libcall, so a guarding condition would be required. So I'd leave it for
another occasion.
Thanks for sparking this discussion overall!
Maciej
Powered by blists - more mailing lists