[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAGudoHFRbAf=3xWTe_asYLb38D1qr59nCTYmJcGcdetiUgyHLA@mail.gmail.com>
Date: Thu, 3 Apr 2025 01:15:48 +0200
From: Mateusz Guzik <mjguzik@...il.com>
To: David Laight <david.laight.linux@...il.com>
Cc: torvalds@...ux-foundation.org, mingo@...hat.com, x86@...nel.org,
linux-kernel@...r.kernel.org
Subject: Re: [RFC PATCH] x86: prevent gcc from emitting rep movsq/stosq for
inlined ops
On Thu, Apr 3, 2025 at 12:29 AM David Laight
<david.laight.linux@...il.com> wrote:
>
> On Wed, 2 Apr 2025 15:42:40 +0200
> Mateusz Guzik <mjguzik@...il.com> wrote:
>
> > Not a real submission yet as I would like results from other people.
> >
> > tl;dr when benchmarking compilation of a hello-world program I'm getting
> > a 1.7% increase in throughput on Sapphire Rapids when convincing the
> > compiler to only use regular stores for inlined memset and memcpy
> >
> > Note this uarch does have FSRM and still benefits from not using it for
> > some cases.
> >
> > I am not in position to bench this on other CPUs, would be nice if
> > someone did it on AMD.
>
> I did some benchmarking of 'rep movsb' on a zen 5.
> Test is: mfence; rdpmc; mfence; test_code; mfence; rdpmc; mfence.
> For large copies you get 64 bytes/clock.
> Short copies (less than 128 bytes) are usually very cheap - maybe 5 clocks
> But it then jumps to 38 clocks.
> And the 'elephant in the room' is when (dest - src) % 4096 is between 1 and 63.
> In that case short copies jump to 55 clocks.
> Otherwise alignment doesn't make much difference.
>
I think this roughly follows the advice on how to do benchmarks, but
at the same time I think it has too much potential to distort
differences when it comes to these routines.
The fence forces the CPU to get rid of the state accumulated prior and
probably prevents it from speculatively messing with instructions
after. But what if uarchs tolerate a mov loop better than rep mov? (up
to a point of course)
Based on my tests with running the compiler, Sapphire Rapids does
prefer the loop approach at least till 256 bytes, despite Fast Short
REP MOV. This can be seen in sync_regs() for example.
If you are up to it, I would appreciate if you ran the actual bench as
described in my opening mail. It is not hard to set up, but it does
require rebuilding the kernel. Perhaps you can do it in a vm, it's not
a scalability bench.
--
Mateusz Guzik <mjguzik gmail.com>
Powered by blists - more mailing lists