linux-kernel - RE: [PATCH v2] x86: bring back rep movsq for user access on CPUs without ERMS

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <ed0ac0937cdf4bb99b273fc0396b46b9@AcuMS.aculab.com>
Date:   Mon, 11 Sep 2023 10:37:58 +0000
From:   David Laight <David.Laight@...LAB.COM>
To:     'Mateusz Guzik' <mjguzik@...il.com>
CC:     "torvalds@...ux-foundation.org" <torvalds@...ux-foundation.org>,
        "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
        "linux-arch@...r.kernel.org" <linux-arch@...r.kernel.org>,
        "bp@...en8.de" <bp@...en8.de>
Subject: RE: [PATCH v2] x86: bring back rep movsq for user access on CPUs
 without ERMS

From: Mateusz Guzik
> Sent: 10 September 2023 11:54
> 
> On 9/3/23, David Laight <David.Laight@...lab.com> wrote:
> > ...
> >> When I was playing with this stuff about 5 years ago I found 32-byte
> >> loops to be optimal for uarchs of the priod (Skylake, Broadwell,
> >> Haswell and so on), but only up to a point where rep wins.
> >
> > Does the 'rep movsq' ever actually win?
> > (Unless you find one of the EMRS (or similar) versions.)
> > IIRC it only ever does one iteration per clock - and you
> > should be able to match that with a carefully constructed loop.
> >
> 
> Sorry for late reply, I missed your e-mail due to all the unrelated
> traffic in the thread and using gmail client. ;)
> 
> I am somewhat confused by the question though. In this very patch I'm
> showing numbers from an ERMS-less uarch getting a win from switching
> from hand-rolled mov loop to rep movsq, while doing 4KB copies.

I've just dome some measurements on an i7-7700.
That does have ERMS (fast 'rep movsb') but shows some interesting info.

The overhead of 'rep movbs' is about 36 clocks, 'rep movsq' only 16.
(except it has just changed its mind...)
'rep movsb' will copy (about) 32 bytes/clock provided the
destination buffer is 32byte aligned, but only 16 bytes/clock
otherwise. The source buffer alignment doesn't seem to matter.

On this system 'rep movsq' seems to behave the same way.

So that is faster than an copy loop - limited to one register
write per clock.

Test program attached.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

View attachment "memcpy_perf.c" of type "text/plain" (2972 bytes)