lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:   Mon, 11 Sep 2023 10:37:58 +0000
From:   David Laight <David.Laight@...LAB.COM>
To:     'Mateusz Guzik' <mjguzik@...il.com>
CC:     "torvalds@...ux-foundation.org" <torvalds@...ux-foundation.org>,
        "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
        "linux-arch@...r.kernel.org" <linux-arch@...r.kernel.org>,
        "bp@...en8.de" <bp@...en8.de>
Subject: RE: [PATCH v2] x86: bring back rep movsq for user access on CPUs
 without ERMS

From: Mateusz Guzik
> Sent: 10 September 2023 11:54
> 
> On 9/3/23, David Laight <David.Laight@...lab.com> wrote:
> > ...
> >> When I was playing with this stuff about 5 years ago I found 32-byte
> >> loops to be optimal for uarchs of the priod (Skylake, Broadwell,
> >> Haswell and so on), but only up to a point where rep wins.
> >
> > Does the 'rep movsq' ever actually win?
> > (Unless you find one of the EMRS (or similar) versions.)
> > IIRC it only ever does one iteration per clock - and you
> > should be able to match that with a carefully constructed loop.
> >
> 
> Sorry for late reply, I missed your e-mail due to all the unrelated
> traffic in the thread and using gmail client. ;)
> 
> I am somewhat confused by the question though. In this very patch I'm
> showing numbers from an ERMS-less uarch getting a win from switching
> from hand-rolled mov loop to rep movsq, while doing 4KB copies.

I've just dome some measurements on an i7-7700.
That does have ERMS (fast 'rep movsb') but shows some interesting info.

The overhead of 'rep movbs' is about 36 clocks, 'rep movsq' only 16.
(except it has just changed its mind...)
'rep movsb' will copy (about) 32 bytes/clock provided the
destination buffer is 32byte aligned, but only 16 bytes/clock
otherwise. The source buffer alignment doesn't seem to matter.

On this system 'rep movsq' seems to behave the same way.

So that is faster than an copy loop - limited to one register
write per clock.

Test program attached.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

View attachment "memcpy_perf.c" of type "text/plain" (2972 bytes)

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ