lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20250609200427.7384908a@pumpkin>
Date: Mon, 9 Jun 2025 20:04:27 +0100
From: David Laight <david.laight.linux@...il.com>
To: Uros Bizjak <ubizjak@...il.com>
Cc: Peter Zijlstra <peterz@...radead.org>, Mateusz Guzik
 <mjguzik@...il.com>, torvalds@...ux-foundation.org, mingo@...hat.com,
 x86@...nel.org, linux-kernel@...r.kernel.org
Subject: Re: [PATCH v2] x86: prevent gcc from emitting rep movsq/stosq for
 inlined ops

On Mon, 9 Jun 2025 08:04:34 +0200
Uros Bizjak <ubizjak@...il.com> wrote:

> On Sun, Jun 8, 2025 at 10:51 PM David Laight
> <david.laight.linux@...il.com> wrote:
..
> > Do you ever want it?
> > From what I remember of benchmarking 'rep movsb' even on Ivy bridge the
> > alignment makes almost no difference to throughput.  
> 
> Please note that the instruction is "rep movsQ", it moves 64bit
> quantities. The alignment is needed to align data to the 64-bit
> boundary.

No it isn't, there is no requirement to align the data for 'rep movsq'.
Even a naive cpu will do misaligned transfers quite happily.
The worst that ought to happen is each memory access being split in two.
Since it is likely that copies will be aligned (or so short it doesn't
matter) the alignment code is just a waste of time.

Even length checks to decide the algorithm cost - and can kill overall
performance if the copies are often short.
You really do need the software to give a compile-time hint of the likely
length and use that select the algorithm. 

I need to check Sandy bridge (I've got one with a recent debian installed)
but even on ivy bridge 'rep movsq' is pretty identical to 'rep movsb'
with the count multiplied by 8.

The fixed/setup costs do vary by cpu, but the per-byte costs for moderate
(a few k - fitting in the D-cache) copies were the same for all the intel
cpu I had to hand at the time.
The only thing that mattered was cache-line aligning %rdi - doubled throughput.

	David


Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ