linux-kernel - Re: [PATCH v2] x86: prevent gcc from emitting rep movsq/stosq for inlined ops

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20250609200427.7384908a@pumpkin>
Date: Mon, 9 Jun 2025 20:04:27 +0100
From: David Laight <david.laight.linux@...il.com>
To: Uros Bizjak <ubizjak@...il.com>
Cc: Peter Zijlstra <peterz@...radead.org>, Mateusz Guzik
 <mjguzik@...il.com>, torvalds@...ux-foundation.org, mingo@...hat.com,
 x86@...nel.org, linux-kernel@...r.kernel.org
Subject: Re: [PATCH v2] x86: prevent gcc from emitting rep movsq/stosq for
 inlined ops

On Mon, 9 Jun 2025 08:04:34 +0200
Uros Bizjak <ubizjak@...il.com> wrote:

> On Sun, Jun 8, 2025 at 10:51 PM David Laight
> <david.laight.linux@...il.com> wrote:
..
> > Do you ever want it?
> > From what I remember of benchmarking 'rep movsb' even on Ivy bridge the
> > alignment makes almost no difference to throughput.  
> 
> Please note that the instruction is "rep movsQ", it moves 64bit
> quantities. The alignment is needed to align data to the 64-bit
> boundary.

No it isn't, there is no requirement to align the data for 'rep movsq'.
Even a naive cpu will do misaligned transfers quite happily.
The worst that ought to happen is each memory access being split in two.
Since it is likely that copies will be aligned (or so short it doesn't
matter) the alignment code is just a waste of time.

Even length checks to decide the algorithm cost - and can kill overall
performance if the copies are often short.
You really do need the software to give a compile-time hint of the likely
length and use that select the algorithm. 

I need to check Sandy bridge (I've got one with a recent debian installed)
but even on ivy bridge 'rep movsq' is pretty identical to 'rep movsb'
with the count multiplied by 8.

The fixed/setup costs do vary by cpu, but the per-byte costs for moderate
(a few k - fitting in the D-cache) copies were the same for all the intel
cpu I had to hand at the time.
The only thing that mattered was cache-line aligning %rdi - doubled throughput.

	David