lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <a4ddb547-ea46-d79d-3088-a97b9a033997@arm.com>
Date:   Fri, 24 Apr 2020 12:00:52 +0100
From:   Robin Murphy <robin.murphy@....com>
To:     David Laight <David.Laight@...LAB.COM>,
        Will Deacon <will@...nel.org>,
        Mark Rutland <mark.rutland@....com>
Cc:     "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
        "linux-arch@...r.kernel.org" <linux-arch@...r.kernel.org>,
        "kernel-team@...roid.com" <kernel-team@...roid.com>,
        Michael Ellerman <mpe@...erman.id.au>,
        Peter Zijlstra <peterz@...radead.org>,
        Linus Torvalds <torvalds@...ux-foundation.org>,
        Segher Boessenkool <segher@...nel.crashing.org>,
        Christian Borntraeger <borntraeger@...ibm.com>,
        Luc Van Oostenryck <luc.vanoostenryck@...il.com>,
        Arnd Bergmann <arnd@...db.de>,
        Peter Oberparleiter <oberpar@...ux.ibm.com>,
        Masahiro Yamada <masahiroy@...nel.org>,
        Nick Desaulniers <ndesaulniers@...gle.com>
Subject: Re: [PATCH v4 05/11] arm64: csum: Disable KASAN for do_csum()

On 2020-04-24 10:41 am, David Laight wrote:
> From: Robin Murphy
>> Sent: 22 April 2020 12:02
> ..
>> Sure - I have a nagging feeling that it could still do better WRT
>> pipelining the loads anyway, so I'm happy to come back and reconsider
>> the local codegen later. It certainly doesn't deserve to stand in the
>> way of cross-arch rework.
> 
> How fast does that loop actually run?

I've not characterised it in detail, but faster than any of the other 
attempts so far ;)

> To my mind it seems to do a lot of operations on each 64bit value.
> I'd have thought that a loop based on:
> 	sum64 = *ptr;
> 	sum64_high = *ptr++ >> 32;
> and then fixing up the result would be faster.
> 
> The x86-64 code is also bad!
> On intel cpu prior to haswell a simple:
> 	sum_64 += *ptr32++;
> is faster than the current code.
> (Although you can do a lot better even on ivy bridge.)

The aim here is to minimise load bandwidth - most Arm cores can slurp 16 
bytes from L1 in a single load as quickly as any smaller amount, so 
nibbling away in little 32-bit chunks would result in up to 4x more load 
cycles. Yes, the C code looks ridiculous, but the other trick is that 
most of those operations don't actually exist. Since a __uint128_t is 
really backed by any two 64-bit GPRs - or if you're careful, one 64-bit 
GPR and the carry flag - all those shifts and rotations are in fact 
resolved by register allocation, so what we end up with is a very neat 
loop of essentially just loads and 64-bit accumulation:

...
  138:   a94030c3        ldp     x3, x12, [x6]
  13c:   a9412cc8        ldp     x8, x11, [x6, #16]
  140:   a94228c4        ldp     x4, x10, [x6, #32]
  144:   a94324c7        ldp     x7, x9, [x6, #48]
  148:   ab03018d        adds    x13, x12, x3
  14c:   510100a5        sub     w5, w5, #0x40
  150:   9a0c0063        adc     x3, x3, x12
  154:   ab08016c        adds    x12, x11, x8
  158:   9a0b0108        adc     x8, x8, x11
  15c:   ab04014b        adds    x11, x10, x4
  160:   9a0a0084        adc     x4, x4, x10
  164:   ab07012a        adds    x10, x9, x7
  168:   9a0900e7        adc     x7, x7, x9
  16c:   ab080069        adds    x9, x3, x8
  170:   9a080063        adc     x3, x3, x8
  174:   ab070088        adds    x8, x4, x7
  178:   9a070084        adc     x4, x4, x7
  17c:   910100c6        add     x6, x6, #0x40
  180:   ab040067        adds    x7, x3, x4
  184:   9a040063        adc     x3, x3, x4
  188:   ab010064        adds    x4, x3, x1
  18c:   9a030023        adc     x3, x1, x3
  190:   710100bf        cmp     w5, #0x40
  194:   aa0303e1        mov     x1, x3
  198:   54fffd0c        b.gt    138 <do_csum+0xd8>
...

Instruction-wise, that's about as good as it can get short of 
maintaining multiple accumulators and moving the pairwise folding out of 
the loop. The main thing that I think is still left on the table is that 
the load-to-use distances are pretty short and there's clearly scope to 
spread out and amortise the load cycles better, which stands to benefit 
both big and little cores.

Robin.

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ