lists.openwall.net | lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC | |
Open Source and information security mailing list archives
| ||
|
Message-ID: <20170914092857.mvarp7iok6jf43sn@gmail.com> Date: Thu, 14 Sep 2017 11:28:57 +0200 From: Ingo Molnar <mingo@...nel.org> To: Josh Poimboeuf <jpoimboe@...hat.com> Cc: Eric Biggers <ebiggers3@...il.com>, x86@...nel.org, linux-kernel@...r.kernel.org, Tim Chen <tim.c.chen@...ux.intel.com>, Mathias Krause <minipli@...glemail.com>, Chandramouli Narayanan <mouli@...ux.intel.com>, Jussi Kivilinna <jussi.kivilinna@....fi>, Peter Zijlstra <peterz@...radead.org>, Herbert Xu <herbert@...dor.apana.org.au>, "David S. Miller" <davem@...emloft.net>, linux-crypto@...r.kernel.org, Eric Biggers <ebiggers@...gle.com>, Andy Lutomirski <luto@...nel.org>, Jiri Slaby <jslaby@...e.cz> Subject: Re: [PATCH 00/12] x86/crypto: Fix RBP usage in several crypto .S files * Ingo Molnar <mingo@...nel.org> wrote: > 1) > > Note how R12 is used immediately, right in the next instruction: > > vpaddq (TBL), Y_0, XFER > > I.e. the RBP fixes lengthen the program order data dependencies - that's a new > constraint and a few extra cycles per loop iteration if the workload is > address-generator bandwidth limited on that. > > A simple way to ease that constraint would be to move the 'TLB' load up into the > loop, body, to the point where 'T1' is used for the last time - which is: > > > mov a, T1 # T1 = a # MAJB > and c, T1 # T1 = a&c # MAJB > > add y0, y2 # y2 = S1 + CH # -- > or T1, y3 # y3 = MAJ = (a|c)&b)|(a&c) # MAJ > > + mov frame_TBL(%rsp), TBL > > add y1, h # h = k + w + h + S0 # -- > > add y2, d # d = k + w + h + d + S1 + CH = d + t1 # -- > > add y2, h # h = k + w + h + S0 + S1 + CH = t1 + S0# -- > add y3, h # h = t1 + S0 + MAJ # -- > > Note how this moves up the 'TLB' reload by 4 instructions. Note that in this case 'TBL' would have to be initialized before the 1st iteration, via something like: movq $4, frame_SRND(%rsp) + mov frame_TBL(%rsp), TBL .align 16 loop1: vpaddq (TBL), Y_0, XFER vmovdqa XFER, frame_XFER(%rsp) FOUR_ROUNDS_AND_SCHED Thanks, Ingo
Powered by blists - more mailing lists