lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <2061267c74254b03ad5b7b23d6dfd961@AcuMS.aculab.com>
Date:   Tue, 17 Sep 2019 10:55:12 +0000
From:   David Laight <David.Laight@...LAB.COM>
To:     'Linus Torvalds' <torvalds@...ux-foundation.org>,
        Rasmus Villemoes <linux@...musvillemoes.dk>
CC:     Borislav Petkov <bp@...en8.de>,
        Rasmus Villemoes <mail@...musvillemoes.dk>,
        x86-ml <x86@...nel.org>, Andy Lutomirski <luto@...nel.org>,
        Josh Poimboeuf <jpoimboe@...hat.com>,
        lkml <linux-kernel@...r.kernel.org>
Subject: RE: [RFC] Improve memset

From: Linus Torvalds
> Sent: 16 September 2019 18:25
...
> You can basically always beat "rep movs/stos" with hand-tuned AVX2/512
> code for specific cases if you don't look at I$ footprint and the cost
> of the AVX setup (and the cost of frequency changes, which often go
> hand-in-hand with the AVX use). So "rep movs/stos" is seldom
> _optimal_, but it tends to be "quite good" for modern CPU's with
> variable sizes that are in the 100+ byte range.

Years ago I managed to match 'rep movs' on my Athlon 700 with a
'normal' code loop.
I can't remember whether I beat the setup time though.
The 'trick' was to do 'read (read-write)*n write' to
avoid stalls and get all the loop processing for free.
The I$ footprint was larger though.

The setup costs for 'rep movx' are significant.
I think the worst cpu was the P4-Netbust at 40-50 clocks.
My guess is over 10 for all cpu (except pre-pentium ones).

IIRC the only cpu on which you should use 'rep movsb' for the
trailing bytes is the one before Intel added the fast copy logic.
That one had special optimisations for 'rep movsb' of length < 8.

Remember, if you are inlining you probably have to assume cold-cache
and an untrained branch predictor.
Most benchmarking is done hot-cache with the branch predictor trained.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ