linux-kernel - Re: [RFC PATCH] x86: prevent gcc from emitting rep movsq/stosq for inlined ops

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20250402232917.6978ffa3@pumpkin>
Date: Wed, 2 Apr 2025 23:29:17 +0100
From: David Laight <david.laight.linux@...il.com>
To: Mateusz Guzik <mjguzik@...il.com>
Cc: torvalds@...ux-foundation.org, mingo@...hat.com, x86@...nel.org,
 linux-kernel@...r.kernel.org
Subject: Re: [RFC PATCH] x86: prevent gcc from emitting rep movsq/stosq for
 inlined ops

On Wed, 2 Apr 2025 15:42:40 +0200
Mateusz Guzik <mjguzik@...il.com> wrote:

> Not a real submission yet as I would like results from other people.
> 
> tl;dr when benchmarking compilation of a hello-world program I'm getting
> a 1.7% increase in throughput on Sapphire Rapids when convincing the
> compiler to only use regular stores for inlined memset and memcpy
> 
> Note this uarch does have FSRM and still benefits from not using it for
> some cases.
> 
> I am not in position to bench this on other CPUs, would be nice if
> someone did it on AMD.

I did some benchmarking of 'rep movsb' on a zen 5.
Test is: mfence; rdpmc; mfence; test_code; mfence; rdpmc; mfence.
For large copies you get 64 bytes/clock.
Short copies (less than 128 bytes) are usually very cheap - maybe 5 clocks
But it then jumps to 38 clocks.
And the 'elephant in the room' is when (dest - src) % 4096 is between 1 and 63.
In that case short copies jump to 55 clocks.
Otherwise alignment doesn't make much difference.

If those values are right you want to use 'rep movsb' for short copies,
but probably not for ones between 128 and 256 bytes!

I might need to run with an inner loop.
The overhead for an empty test (an asm block with "nop" instead of "rep movsb")
is 180 clocks (and subtracted from the above clock counts).
But I've used the same scheme for 'normal' instructions (testing ipcsum)
and got sane results.

	David