linux-kernel - RE: [RFC] Improve memset

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <2061267c74254b03ad5b7b23d6dfd961@AcuMS.aculab.com>
Date:   Tue, 17 Sep 2019 10:55:12 +0000
From:   David Laight <David.Laight@...LAB.COM>
To:     'Linus Torvalds' <torvalds@...ux-foundation.org>,
        Rasmus Villemoes <linux@...musvillemoes.dk>
CC:     Borislav Petkov <bp@...en8.de>,
        Rasmus Villemoes <mail@...musvillemoes.dk>,
        x86-ml <x86@...nel.org>, Andy Lutomirski <luto@...nel.org>,
        Josh Poimboeuf <jpoimboe@...hat.com>,
        lkml <linux-kernel@...r.kernel.org>
Subject: RE: [RFC] Improve memset

From: Linus Torvalds
> Sent: 16 September 2019 18:25
...
> You can basically always beat "rep movs/stos" with hand-tuned AVX2/512
> code for specific cases if you don't look at I$ footprint and the cost
> of the AVX setup (and the cost of frequency changes, which often go
> hand-in-hand with the AVX use). So "rep movs/stos" is seldom
> _optimal_, but it tends to be "quite good" for modern CPU's with
> variable sizes that are in the 100+ byte range.

Years ago I managed to match 'rep movs' on my Athlon 700 with a
'normal' code loop.
I can't remember whether I beat the setup time though.
The 'trick' was to do 'read (read-write)*n write' to
avoid stalls and get all the loop processing for free.
The I$ footprint was larger though.

The setup costs for 'rep movx' are significant.
I think the worst cpu was the P4-Netbust at 40-50 clocks.
My guess is over 10 for all cpu (except pre-pentium ones).

IIRC the only cpu on which you should use 'rep movsb' for the
trailing bytes is the one before Intel added the fast copy logic.
That one had special optimisations for 'rep movsb' of length < 8.

Remember, if you are inlining you probably have to assume cold-cache
and an untrained branch predictor.
Most benchmarking is done hot-cache with the branch predictor trained.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)