[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <403s8q39-33sp-pp3s-95o8-14s190or25o5@onlyvoer.pbz>
Date: Wed, 21 May 2025 09:50:28 -0400 (EDT)
From: Nicolas Pitre <npitre@...libre.com>
To: David Laight <david.laight.linux@...il.com>
cc: Andrew Morton <akpm@...ux-foundation.org>, linux-kernel@...r.kernel.org,
u.kleine-koenig@...libre.com, Oleg Nesterov <oleg@...hat.com>,
Peter Zijlstra <peterz@...radead.org>,
Biju Das <biju.das.jz@...renesas.com>
Subject: Re: [PATCH v2 next 3/4] lib: Add mul_u64_add_u64_div_u64() and
mul_u64_u64_div_u64_roundup()
On Wed, 21 May 2025, David Laight wrote:
> On Tue, 20 May 2025 18:24:58 -0400 (EDT)
> Nicolas Pitre <npitre@...libre.com> wrote:
>
> > On Tue, 20 May 2025, David Laight wrote:
> >
> > > On Mon, 19 May 2025 23:03:21 -0400 (EDT)
> > > Nicolas Pitre <npitre@...libre.com> wrote:
> > >
> ...
> > > > Here you should do:
> > > >
> > > > if (ilog2(a) + ilog2(b) <= 62) {
> > > > u64 ab = a * b;
> > > > u64 abc = ab + c;
> > > > if (ab <= abc)
> > > > return div64_u64(abc, d);
> > > > }
> > > >
> > > > This is cheap and won't unconditionally discard the faster path when c != 0;
> > >
> > > That isn't really cheap.
> > > ilog2() is likely to be a similar cost to a multiply
> > > (my brain remembers them both as 'latency 3' on x86).
> >
> > I'm not discussing the ilog2() usage though. I'm just against limiting
> > the test to !c. My suggestion is about supporting all values of c.
>
> I've had further thoughts on that test.
> Most (but not all - and I've forgotten which) 64bit cpu have a 64x64=>128
> multiple and support u128.
Looks like X86-64, ARM64 and RV64 have it. That's probably 99% of the market.
> So the 'multiply in parts' code is mostly for 32bit.
Exact.
> That means that the 'a * b' for the call to div64_u64() has to be three
> 32x32=>64 multiplies with all the extra 'add' and 'adc $0' to generate
> a correct 64bit result.
4 multiplies to be precise.
> This is (well should be) much the same as the multiply coded in the
> function - except it is generated by the compiler itself.
I don't follow you here. What is the same as what?
> The only parts it can ignore are the those that set 'z' and 'y_hi'.
> If my clock sequence (in the other email) is correct it saves 3 of 10
> clocks - so test to avoid the multiply has to be better than that.
> That probably means the only worthwhile check is for a and b being 32bit
> so a single multiply can be used.
Depends how costly the ilog2 is. On ARM the clz instruction is about 1
cycle. If you need to figure out the MSB manually then it might be best
to skip those ilog2's.
> The generated code for 32bit x86 isn't as good as one might hope.
> partially due to only having 7 (6 if %bp is a stack frame) registers.
> clang makes a reasonable job of it, gcc doesn't.
> There is already a mul_u32_u32() wrapper in arch/x86/include/asm/div64.h.
> There needs to be a similar add_u64_u32() (contains add %s,%d_lo, adc $0,%d_hi).
> Without them gcc spills a lot of values to stack - including constant zeros.
I mainly looked at ARM32 and both gcc and clang do a good job here. ARM
registers are plentiful of course.
> I might add those and use them in v3 (which I need to send anyway).
> They'll match what my 'pending' faster code does.
>
> David
>
>
Powered by blists - more mailing lists