lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAGudoHFRbAf=3xWTe_asYLb38D1qr59nCTYmJcGcdetiUgyHLA@mail.gmail.com>
Date: Thu, 3 Apr 2025 01:15:48 +0200
From: Mateusz Guzik <mjguzik@...il.com>
To: David Laight <david.laight.linux@...il.com>
Cc: torvalds@...ux-foundation.org, mingo@...hat.com, x86@...nel.org, 
	linux-kernel@...r.kernel.org
Subject: Re: [RFC PATCH] x86: prevent gcc from emitting rep movsq/stosq for
 inlined ops

On Thu, Apr 3, 2025 at 12:29 AM David Laight
<david.laight.linux@...il.com> wrote:
>
> On Wed, 2 Apr 2025 15:42:40 +0200
> Mateusz Guzik <mjguzik@...il.com> wrote:
>
> > Not a real submission yet as I would like results from other people.
> >
> > tl;dr when benchmarking compilation of a hello-world program I'm getting
> > a 1.7% increase in throughput on Sapphire Rapids when convincing the
> > compiler to only use regular stores for inlined memset and memcpy
> >
> > Note this uarch does have FSRM and still benefits from not using it for
> > some cases.
> >
> > I am not in position to bench this on other CPUs, would be nice if
> > someone did it on AMD.
>
> I did some benchmarking of 'rep movsb' on a zen 5.
> Test is: mfence; rdpmc; mfence; test_code; mfence; rdpmc; mfence.
> For large copies you get 64 bytes/clock.
> Short copies (less than 128 bytes) are usually very cheap - maybe 5 clocks
> But it then jumps to 38 clocks.
> And the 'elephant in the room' is when (dest - src) % 4096 is between 1 and 63.
> In that case short copies jump to 55 clocks.
> Otherwise alignment doesn't make much difference.
>

I think this roughly follows the advice on how to do benchmarks, but
at the same time I think it has too much potential to distort
differences when it comes to these routines.

The fence forces the CPU to get rid of the state accumulated prior and
probably prevents it from speculatively messing with instructions
after. But what if uarchs tolerate a mov loop better than rep mov? (up
to a point of course)

Based on my tests with running the compiler, Sapphire Rapids does
prefer the loop approach at least till 256 bytes, despite Fast Short
REP MOV. This can be seen in sync_regs() for example.

If you are up to it, I would appreciate if you ran the actual bench as
described in my opening mail. It is not hard to set up, but it does
require rebuilding the kernel. Perhaps you can do it in a vm, it's not
a scalability bench.

-- 
Mateusz Guzik <mjguzik gmail.com>

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ