lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <9a5dd401bf154a0aace0e5f781a3580c@AcuMS.aculab.com>
Date:   Sun, 3 Sep 2023 20:42:55 +0000
From:   David Laight <David.Laight@...LAB.COM>
To:     'Mateusz Guzik' <mjguzik@...il.com>
CC:     "torvalds@...ux-foundation.org" <torvalds@...ux-foundation.org>,
        "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
        "linux-arch@...r.kernel.org" <linux-arch@...r.kernel.org>,
        "bp@...en8.de" <bp@...en8.de>
Subject: RE: [PATCH v2] x86: bring back rep movsq for user access on CPUs
 without ERMS

...
> When I was playing with this stuff about 5 years ago I found 32-byte
> loops to be optimal for uarchs of the priod (Skylake, Broadwell,
> Haswell and so on), but only up to a point where rep wins.

Does the 'rep movsq' ever actually win?
(Unless you find one of the EMRS (or similar) versions.)
IIRC it only ever does one iteration per clock - and you
should be able to match that with a carefully constructed loop.

Many years ago I got my Athlon-700 to execute a copy loop
as fast as 'rep movs' - but the setup times were longer.

The killer for 'rep movs' setup was always P4-netburst - over 40 clocks.
But I think some of the more recent cpu are still in double figures
(apart from some optimised copies).
So I'm not actually sure you should ever need to switch
to a 'rep movsq' loop - but I've not tried to write it.

I did have to unroll the ip-cksum loop 4 times (as):
+       asm(    "       bt    $4, %[len]\n"
+               "       jnc   10f\n"
+               "       add   (%[buff], %[len]), %[sum_0]\n"
+               "       adc   8(%[buff], %[len]), %[sum_1]\n"
+               "       lea   16(%[len]), %[len]\n"
+               "10:    jecxz 20f\n"   // %[len] is %rcx
+               "       adc   (%[buff], %[len]), %[sum_0]\n"
+               "       adc   8(%[buff], %[len]), %[sum_1]\n"
+               "       lea   32(%[len]), %[len_tmp]\n"
+               "       adc   16(%[buff], %[len]), %[sum_0]\n"
+               "       adc   24(%[buff], %[len]), %[sum_1]\n"
+               "       mov   %[len_tmp], %[len]\n"
+               "       jmp   10b\n"
+               "20:    adc   %[sum_0], %[sum]\n"
+               "       adc   %[sum_1], %[sum]\n"
+               "       adc   $0, %[sum]\n"
In order to get one adc every clock.
But only because of the strange loop required to 'loop carry' the
carry flag (the 'loop' instruction is OK on AMD cpu, but not on Intel.)
A similar loop using adox and adcx will beat one read/clock
provided it is unrolled again.
(IIRC I got to about 12 bytes/clock.)

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ