[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAFULd4ZzdPcnQAELpukF4vzUnQufteNqV4BzZr3BxuzRG+XK+A@mail.gmail.com>
Date: Thu, 6 Mar 2025 11:45:38 +0100
From: Uros Bizjak <ubizjak@...il.com>
To: David Laight <david.laight.linux@...il.com>
Cc: Linus Torvalds <torvalds@...uxfoundation.org>, Borislav Petkov <bp@...en8.de>,
Dave Hansen <dave.hansen@...el.com>, x86@...nel.org, linux-kernel@...r.kernel.org,
Peter Zijlstra <peterz@...radead.org>, Thomas Gleixner <tglx@...utronix.de>,
Ingo Molnar <mingo@...nel.org>, Dave Hansen <dave.hansen@...ux.intel.com>,
"H. Peter Anvin" <hpa@...or.com>
Subject: Re: [PATCH -tip] x86/locking/atomic: Use asm_inline for atomic
locking insns
On Wed, Mar 5, 2025 at 9:14 PM David Laight
<david.laight.linux@...il.com> wrote:
>
> On Wed, 5 Mar 2025 07:04:08 -1000
> Linus Torvalds <torvalds@...uxfoundation.org> wrote:
>
> > On Tue, 4 Mar 2025 at 22:54, Uros Bizjak <ubizjak@...il.com> wrote:
> > >
> > > Even to my surprise, the patch has some noticeable effects on the
> > > performance, please see the attachment in [1] for LMBench data or [2]
> > > for some excerpts from the data. So, I think the patch has potential
> > > to improve the performance.
> >
> > I suspect some of the performance difference - which looks
> > unexpectedly large - is due to having run them on a CPU with the
> > horrendous indirect return costs, and then inlining can make a huge
> > difference.
> ...
>
> Another possibility is that the processes are getting bounced around
> cpu in a slightly different way.
> An idle cpu might be running at 800MHz, run something that spins on it
> and the clock speed will soon jump to 4GHz.
> But if your 'spinning' process is migrated to a different cpu it starts
> again at 800MHz.
>
> (I had something where a fpga compile when from 12 mins to over 20 because
> the kernel RSB stuffing caused the scheduler to behave differently even
> though nothing was doing a lot of system calls.)
>
> All sorts of things can affect that - possibly even making some code faster!
>
> The (IIRC) 30k increase in code size will be a few functions being inlined.
> The bloat-o-meter might show which, and forcing a few inlines the same way
> should reduce that difference.
bloat-o-meter is an excellent idea, I'll analyse binaries some more
and report my findings.
> OTOH I'm surprised that a single (or two) instruction makes that much
> difference - unless gcc is managing to discard the size of the entire
> function rather than just the asm block itself.
Actually, the compiler uses estimated function code size as one of the
conditions when to fully (or partially - hot/cold split) inline the
function. The estimated code size of functions that use (patched)
locking primitives is now lower, so now they fall below the inlining
threshold, causing the compiler to do more inlining. Compiler knows
the performance/size tradeoff of setting up a function call and the
perf/size tradeoff of creating the function frame in the called
function and decides accordingly. Please note that the inlining is
multi-level, so it doesn't stop at the first function.
Thanks,
Uros.
Powered by blists - more mailing lists