lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <88d43a6e2a7b497b9b1e448bf7ae068a@AcuMS.aculab.com>
Date:   Mon, 23 Mar 2020 09:15:45 +0000
From:   David Laight <David.Laight@...LAB.COM>
To:     'Tom Herbert' <tom@...bertland.com>
CC:     Eric Dumazet <eric.dumazet@...il.com>,
        Willem de Bruijn <willemdebruijn.kernel@...il.com>,
        Yadu Kishore <kyk.segfault@...il.com>,
        David Miller <davem@...emloft.net>,
        Network Development <netdev@...r.kernel.org>
Subject: RE: [PATCH v2] net: Make skb_segment not to compute checksum if
 network controller supports checksumming

From: Tom Herbert
> Sent: 22 March 2020 19:54
...
> A while back, I had proposed an optimized x86 checksum function using
> unrolled addq a while back https://lwn.net/Articles/679137/. Also,
> this tries to optimize from small checksum like over header when doing
> skb_postpull_rcsum.

Look for the patch I submitted a few weeks back.
Attached, loop body is:

+       asm(    "       bt    $4, %[len]\n"
+               "       jnc   10f\n"
+               "       add   (%[buff], %[len]), %[sum_0]\n"
+               "       adc   8(%[buff], %[len]), %[sum_1]\n"
+               "       lea   16(%[len]), %[len]\n"
+               "10:    jecxz 20f\n"
+               "       adc   (%[buff], %[len]), %[sum_0]\n"
+               "       adc   8(%[buff], %[len]), %[sum_1]\n"
+               "       lea   32(%[len]), %[len_tmp]\n"
+               "       adc   16(%[buff], %[len]), %[sum_0]\n"
+               "       adc   24(%[buff], %[len]), %[sum_1]\n"
+               "       mov   %[len_tmp], %[len]\n"
+               "       jmp   10b\n"
+               "20:    adc   %[sum_0], %[sum]\n"
+               "       adc   %[sum_1], %[sum]\n"
+               "       adc   $0, %[sum]\n"
+           : [sum] "+&r" (sum), [sum_0] "+&r" (sum_0), [sum_1] "+&r" (sum_1),
+               [len] "+&c" (len), [len_tmp] "=&r" (len_tmp)
+           : [buff] "r" (buff)
+           : "memory" );

You don't need a very unrolled loop - just a loop that doesn't
have dependencies on parts of the flags register
(in particular dont require 'dec' to preserve carry).
You also need to adc to alternating destinations for some chips.

With the fast 'rep movs' on modern cpu I suspect that trying
to do the copy at the same time as the checksum is a loss.

The 'adc memory' is one instruction and 2-uops.
Fairly easy to execute a 1/clock.
IIRC on Haswell you don't need to unroll the loop at all.
(You can use dec and adc).
A checksum+copy loop is load+adc+store which is 3 instructions
and 5 uops - starts hitting the limits on uops/clock
and retirement, so won't run in 1 clock and the loop will
always be extra.
Whereas a 'rep movs' is likely to move 1 cache line/clock.
ISTR 'rep movq' is faster than 8 bytes/clock on my 'ivy bridge'.

On Haswell you can approach 16 bytes/clock using ad[oc]x.
But it is painful, and I'm not sure that all cpu that support
the instructions actually execute them quickly.

If you can't use adc (eg in C or MIPS) there are two
obvious solutions:

1) Add 32bit words to a 64bit register.
   On x86 this is as fast as adc on everything prior to sandy bridge!

2) Add 64bit words to a 64bit reg and (word >> 32) to a second.
   Fixup for the missing carry (check high bit of main sum matches
   relevant bit of shifted sum).
   This could be faster than the horrid arm64 asm code.
   1 memory load, 1 add, 1 'shift + add' per 8 bytes.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

Download attachment "0001-Optimise-x86-IP-checksum-code.patch" of type "application/octet-stream" (7395 bytes)

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ