lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20250402232917.6978ffa3@pumpkin>
Date: Wed, 2 Apr 2025 23:29:17 +0100
From: David Laight <david.laight.linux@...il.com>
To: Mateusz Guzik <mjguzik@...il.com>
Cc: torvalds@...ux-foundation.org, mingo@...hat.com, x86@...nel.org,
 linux-kernel@...r.kernel.org
Subject: Re: [RFC PATCH] x86: prevent gcc from emitting rep movsq/stosq for
 inlined ops

On Wed, 2 Apr 2025 15:42:40 +0200
Mateusz Guzik <mjguzik@...il.com> wrote:

> Not a real submission yet as I would like results from other people.
> 
> tl;dr when benchmarking compilation of a hello-world program I'm getting
> a 1.7% increase in throughput on Sapphire Rapids when convincing the
> compiler to only use regular stores for inlined memset and memcpy
> 
> Note this uarch does have FSRM and still benefits from not using it for
> some cases.
> 
> I am not in position to bench this on other CPUs, would be nice if
> someone did it on AMD.

I did some benchmarking of 'rep movsb' on a zen 5.
Test is: mfence; rdpmc; mfence; test_code; mfence; rdpmc; mfence.
For large copies you get 64 bytes/clock.
Short copies (less than 128 bytes) are usually very cheap - maybe 5 clocks
But it then jumps to 38 clocks.
And the 'elephant in the room' is when (dest - src) % 4096 is between 1 and 63.
In that case short copies jump to 55 clocks.
Otherwise alignment doesn't make much difference.

If those values are right you want to use 'rep movsb' for short copies,
but probably not for ones between 128 and 256 bytes!

I might need to run with an inner loop.
The overhead for an empty test (an asm block with "nop" instead of "rep movsb")
is 180 clocks (and subtracted from the above clock counts).
But I've used the same scheme for 'normal' instructions (testing ipcsum)
and got sane results.

	David

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ