[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20250722225653.65520fa3@pumpkin>
Date: Tue, 22 Jul 2025 22:56:53 +0100
From: David Laight <david.laight.linux@...il.com>
To: Oleg Nesterov <oleg@...hat.com>
Cc: "H. Peter Anvin" <hpa@...or.com>, Ingo Molnar <mingo@...hat.com>, Peter
Zijlstra <peterz@...radead.org>, Thomas Gleixner <tglx@...utronix.de>,
Borislav Petkov <bp@...en8.de>, Dave Hansen <dave.hansen@...ux.intel.com>,
"Li,Rongqing" <lirongqing@...du.com>, Steven Rostedt <rostedt@...dmis.org>,
linux-kernel@...r.kernel.org, x86@...nel.org
Subject: Re: [PATCH] x86/math64: handle #DE in mul_u64_u64_div_u64()
On Tue, 22 Jul 2025 20:38:54 +0200
Oleg Nesterov <oleg@...hat.com> wrote:
> On 07/22, H. Peter Anvin wrote:
> >
> > On July 22, 2025 10:58:08 AM PDT, Oleg Nesterov <oleg@...hat.com> wrote:
> > >On 07/22, H. Peter Anvin wrote:
> > >>
> > >> On July 22, 2025 3:50:35 AM PDT, Oleg Nesterov <oleg@...hat.com> wrote:
> > >> >
> > >> >The generic implementation doesn't WARN... OK, I won't argue.
> > >> >How about
> > >> >
> > >> > static inline u64 mul_u64_u64_div_u64(u64 a, u64 mul, u64 div)
> > >> > {
> > >> > char ok = 0;
> > >> > u64 q;
> > >> >
> > >> > asm ("mulq %3; 1: divq %4; movb $1,%1; 2:\n"
> > >> > _ASM_EXTABLE(1b, 2b)
> > >> > : "=a" (q), "+r" (ok)
> > >> > : "a" (a), "rm" (mul), "rm" (div)
> > >> > : "rdx");
> > >> >
> > >> > if (ok)
> > >> > return q;
> > >> > BUG_ON(!div);
> > >> > WARN_ON_ONCE(1);
> > >> > return ~(u64)0;
> > >> > }
> > >> >
> > >> >?
> > >> >
> > >> >Oleg.
> > >>
> > >> Maybe the generic version *should* warn?
> > >
> > >David is going to change the generic version to WARN.
> > >
> > >> As far as the ok output, the Right Way™ to do it is with an asm goto instead
> > >> of a status variable; the second best tends to be to use the flags output.
> > >
> > >This is what I was going to do initially. But this needs
> > >CONFIG_CC_HAS_ASM_GOTO_OUTPUT
> > >
> > >Oleg.
> > >
> >
> > But that's what you want to optimize for, since that is all the modern compilers, even if you have to have two versions as a result.
>
> Well, this 'divq' is slow anyway, I don't won't to add 2 versions.
> Can we add the optimized version later if it really makes sense?
Yes, what matters more is code size and simplicity of use (by the caller).
Zen3 has a reasonably fast divq, but you have to get to 'cannon lake' to
get an intel one that isn't near to 100 clocks.
The generic code is horrid - nearly 1000 clocks for random data running
the 32bit code on sandy bridge! (I'm not sure newer will be much better).
(And non-x86 without 'sh[rl]d' will be worse.)
I've re-written it (patches posted a few weeks back).
Sandy bridge is now (from memory) ~250 clcoks in 32bit and ~150 in 64bit,
zen5 ~80 (helped by the much faster divq used for 64/64 divide).
The killer is mispredicted branches, change the arguments so a
slightly different path is taken and it costs at least 20 clocks.
I'm timing single calls, any kind of loop trains the branch predictor
(I'm not running cold cache though).
Given those clock counts I rally wouldn't worry about a few integer
instructions in the x86_64 path.
David
>
> Oleg.
>
Powered by blists - more mailing lists