netdev - RE: [PATCH v2] net: Make skb_segment not to compute checksum if network controller supports checksumming

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <88d43a6e2a7b497b9b1e448bf7ae068a@AcuMS.aculab.com>
Date:   Mon, 23 Mar 2020 09:15:45 +0000
From:   David Laight <David.Laight@...LAB.COM>
To:     'Tom Herbert' <tom@...bertland.com>
CC:     Eric Dumazet <eric.dumazet@...il.com>,
        Willem de Bruijn <willemdebruijn.kernel@...il.com>,
        Yadu Kishore <kyk.segfault@...il.com>,
        David Miller <davem@...emloft.net>,
        Network Development <netdev@...r.kernel.org>
Subject: RE: [PATCH v2] net: Make skb_segment not to compute checksum if
 network controller supports checksumming

From: Tom Herbert
> Sent: 22 March 2020 19:54
...
> A while back, I had proposed an optimized x86 checksum function using
> unrolled addq a while back https://lwn.net/Articles/679137/. Also,
> this tries to optimize from small checksum like over header when doing
> skb_postpull_rcsum.

Look for the patch I submitted a few weeks back.
Attached, loop body is:

+       asm(    "       bt    $4, %[len]\n"
+               "       jnc   10f\n"
+               "       add   (%[buff], %[len]), %[sum_0]\n"
+               "       adc   8(%[buff], %[len]), %[sum_1]\n"
+               "       lea   16(%[len]), %[len]\n"
+               "10:    jecxz 20f\n"
+               "       adc   (%[buff], %[len]), %[sum_0]\n"
+               "       adc   8(%[buff], %[len]), %[sum_1]\n"
+               "       lea   32(%[len]), %[len_tmp]\n"
+               "       adc   16(%[buff], %[len]), %[sum_0]\n"
+               "       adc   24(%[buff], %[len]), %[sum_1]\n"
+               "       mov   %[len_tmp], %[len]\n"
+               "       jmp   10b\n"
+               "20:    adc   %[sum_0], %[sum]\n"
+               "       adc   %[sum_1], %[sum]\n"
+               "       adc   $0, %[sum]\n"
+           : [sum] "+&r" (sum), [sum_0] "+&r" (sum_0), [sum_1] "+&r" (sum_1),
+               [len] "+&c" (len), [len_tmp] "=&r" (len_tmp)
+           : [buff] "r" (buff)
+           : "memory" );

You don't need a very unrolled loop - just a loop that doesn't
have dependencies on parts of the flags register
(in particular dont require 'dec' to preserve carry).
You also need to adc to alternating destinations for some chips.

With the fast 'rep movs' on modern cpu I suspect that trying
to do the copy at the same time as the checksum is a loss.

The 'adc memory' is one instruction and 2-uops.
Fairly easy to execute a 1/clock.
IIRC on Haswell you don't need to unroll the loop at all.
(You can use dec and adc).
A checksum+copy loop is load+adc+store which is 3 instructions
and 5 uops - starts hitting the limits on uops/clock
and retirement, so won't run in 1 clock and the loop will
always be extra.
Whereas a 'rep movs' is likely to move 1 cache line/clock.
ISTR 'rep movq' is faster than 8 bytes/clock on my 'ivy bridge'.

On Haswell you can approach 16 bytes/clock using ad[oc]x.
But it is painful, and I'm not sure that all cpu that support
the instructions actually execute them quickly.

If you can't use adc (eg in C or MIPS) there are two
obvious solutions:

1) Add 32bit words to a 64bit register.
   On x86 this is as fast as adc on everything prior to sandy bridge!

2) Add 64bit words to a 64bit reg and (word >> 32) to a second.
   Fixup for the missing carry (check high bit of main sum matches
   relevant bit of shifted sum).
   This could be faster than the horrid arm64 asm code.
   1 memory load, 1 add, 1 'shift + add' per 8 bytes.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

Download attachment "0001-Optimise-x86-IP-checksum-code.patch" of type "application/octet-stream" (7395 bytes)