lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:   Thu, 14 Sep 2017 11:28:57 +0200
From:   Ingo Molnar <mingo@...nel.org>
To:     Josh Poimboeuf <jpoimboe@...hat.com>
Cc:     Eric Biggers <ebiggers3@...il.com>, x86@...nel.org,
        linux-kernel@...r.kernel.org,
        Tim Chen <tim.c.chen@...ux.intel.com>,
        Mathias Krause <minipli@...glemail.com>,
        Chandramouli Narayanan <mouli@...ux.intel.com>,
        Jussi Kivilinna <jussi.kivilinna@....fi>,
        Peter Zijlstra <peterz@...radead.org>,
        Herbert Xu <herbert@...dor.apana.org.au>,
        "David S. Miller" <davem@...emloft.net>,
        linux-crypto@...r.kernel.org, Eric Biggers <ebiggers@...gle.com>,
        Andy Lutomirski <luto@...nel.org>, Jiri Slaby <jslaby@...e.cz>
Subject: Re: [PATCH 00/12] x86/crypto: Fix RBP usage in several crypto .S
 files


* Ingo Molnar <mingo@...nel.org> wrote:

> 1)
> 
> Note how R12 is used immediately, right in the next instruction:
> 
>         vpaddq  (TBL), Y_0, XFER
> 
> I.e. the RBP fixes lengthen the program order data dependencies - that's a new 
> constraint and a few extra cycles per loop iteration if the workload is 
> address-generator bandwidth limited on that.
> 
> A simple way to ease that constraint would be to move the 'TLB' load up into the 
> loop, body, to the point where 'T1' is used for the last time - which is:
> 
> 
>         mov     a, T1           # T1 = a                                # MAJB
>         and     c, T1           # T1 = a&c                              # MAJB
> 
>         add     y0, y2          # y2 = S1 + CH                          # --
>         or      T1, y3          # y3 = MAJ = (a|c)&b)|(a&c)             # MAJ
> 
> +       mov frame_TBL(%rsp), TBL
> 
>         add     y1, h           # h = k + w + h + S0                    # --
> 
>         add     y2, d           # d = k + w + h + d + S1 + CH = d + t1  # --
> 
>         add     y2, h           # h = k + w + h + S0 + S1 + CH = t1 + S0# --
>         add     y3, h           # h = t1 + S0 + MAJ                     # --
> 
> Note how this moves up the 'TLB' reload by 4 instructions.

Note that in this case 'TBL' would have to be initialized before the 1st 
iteration, via something like:

        movq    $4, frame_SRND(%rsp)

+	mov frame_TBL(%rsp), TBL

.align 16
loop1:
        vpaddq  (TBL), Y_0, XFER
        vmovdqa XFER, frame_XFER(%rsp)
        FOUR_ROUNDS_AND_SCHED

Thanks,

	Ingo

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ