linux-kernel - RE: [PATCH] lib/x86: Optimise csum_partial of buffers that are not multiples of 8 bytes.

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [day] [month] [year] [list]

Message-ID: <3107b1e365f34df080feefb68be8a422@AcuMS.aculab.com>
Date:   Tue, 14 Dec 2021 12:36:07 +0000
From:   David Laight <David.Laight@...LAB.COM>
To:     David Laight <David.Laight@...LAB.COM>,
        'Noah Goldstein' <goldstein.w.n@...il.com>,
        'Eric Dumazet' <edumazet@...gle.com>
CC:     "'tglx@...utronix.de'" <tglx@...utronix.de>,
        "'mingo@...hat.com'" <mingo@...hat.com>,
        'Borislav Petkov' <bp@...en8.de>,
        "'dave.hansen@...ux.intel.com'" <dave.hansen@...ux.intel.com>,
        'X86 ML' <x86@...nel.org>, "'hpa@...or.com'" <hpa@...or.com>,
        "'peterz@...radead.org'" <peterz@...radead.org>,
        "'alexanderduyck@...com'" <alexanderduyck@...com>,
        'open list' <linux-kernel@...r.kernel.org>,
        'netdev' <netdev@...r.kernel.org>
Subject: RE: [PATCH] lib/x86: Optimise csum_partial of buffers that are not
 multiples of 8 bytes.

From: David Laight <David.Laight@...LAB.COM>
> Sent: 13 December 2021 18:01
> 
> Add in the trailing bytes first so that there is no need to worry
> about the sum exceeding 64 bits.

This is an alternate version that (mostly) compiles to reasonable code.
I've also booted a kernel with it - networking still works!

https://godbolt.org/z/K6vY31Gqs

I changed the while (len >= 64) loop into an
if (len >= 64) do (...) while(len >= 64) one.
But gcc makes a pigs breakfast of compiling it - it optimises
it so that it is while (ptr < lim) but adds a lot of code.
So I've done that by hand.
Then it still makes a meal of it because it refuses to take
'buff' from the final loop iteration.
An assignment to the limit helps.

Then there is the calculation of (8 - (len & 7)) * 8.
gcc prior to 9.2 just negate (len & 7) then use leal 56(,%rs1,8),%rcx.
But later ones and fail to notice.
Even given (64 + 8 * -(len & 7)) clang fails to use leal.

I'm not even sure the code clang generates is right:
(%rsi is (len & 7))
        movq    -8(%rsi,%rax), %rdx
        leal    (,%rsi,8), %ecx
        andb    $56, %cl
        negb    %cl
        shrq    %cl, %rdx

The 'negb' is the wrong size of the 'andb'.
It might be ok if it is assuming the cpu ignores the high 2 bits of %cl.
But that is a horrid assumption to be making.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)