[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20250521135246.7dab6bda@pumpkin>
Date: Wed, 21 May 2025 13:52:46 +0100
From: David Laight <david.laight.linux@...il.com>
To: Nicolas Pitre <npitre@...libre.com>
Cc: Andrew Morton <akpm@...ux-foundation.org>, linux-kernel@...r.kernel.org,
u.kleine-koenig@...libre.com, Oleg Nesterov <oleg@...hat.com>, Peter
Zijlstra <peterz@...radead.org>, Biju Das <biju.das.jz@...renesas.com>
Subject: Re: [PATCH v2 next 3/4] lib: Add mul_u64_add_u64_div_u64() and
mul_u64_u64_div_u64_roundup()
On Tue, 20 May 2025 18:24:58 -0400 (EDT)
Nicolas Pitre <npitre@...libre.com> wrote:
> On Tue, 20 May 2025, David Laight wrote:
>
> > On Mon, 19 May 2025 23:03:21 -0400 (EDT)
> > Nicolas Pitre <npitre@...libre.com> wrote:
> >
...
> > > Here you should do:
> > >
> > > if (ilog2(a) + ilog2(b) <= 62) {
> > > u64 ab = a * b;
> > > u64 abc = ab + c;
> > > if (ab <= abc)
> > > return div64_u64(abc, d);
> > > }
> > >
> > > This is cheap and won't unconditionally discard the faster path when c != 0;
> >
> > That isn't really cheap.
> > ilog2() is likely to be a similar cost to a multiply
> > (my brain remembers them both as 'latency 3' on x86).
>
> I'm not discussing the ilog2() usage though. I'm just against limiting
> the test to !c. My suggestion is about supporting all values of c.
I've had further thoughts on that test.
Most (but not all - and I've forgotten which) 64bit cpu have a 64x64=>128
multiple and support u128.
So the 'multiply in parts' code is mostly for 32bit.
That means that the 'a * b' for the call to div64_u64() has to be three
32x32=>64 multiplies with all the extra 'add' and 'adc $0' to generate
a correct 64bit result.
This is (well should be) much the same as the multiply coded in the
function - except it is generated by the compiler itself.
The only parts it can ignore are the those that set 'z' and 'y_hi'.
If my clock sequence (in the other email) is correct it saves 3 of 10
clocks - so test to avoid the multiply has to be better than that.
That probably means the only worthwhile check is for a and b being 32bit
so a single multiply can be used.
The generated code for 32bit x86 isn't as good as one might hope.
partially due to only having 7 (6 if %bp is a stack frame) registers.
clang makes a reasonable job of it, gcc doesn't.
There is already a mul_u32_u32() wrapper in arch/x86/include/asm/div64.h.
There needs to be a similar add_u64_u32() (contains add %s,%d_lo, adc $0,%d_hi).
Without them gcc spills a lot of values to stack - including constant zeros.
I might add those and use them in v3 (which I need to send anyway).
They'll match what my 'pending' faster code does.
David
Powered by blists - more mailing lists