[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAMj1kXGhGopfg19at5N_9q89-UA4irSgMULyDXg+dKhnbRrCZQ@mail.gmail.com>
Date: Sat, 19 Dec 2020 18:03:45 +0100
From: Ard Biesheuvel <ardb@...nel.org>
To: Megha Dey <megha.dey@...el.com>
Cc: Herbert Xu <herbert@...dor.apana.org.au>,
"David S. Miller" <davem@...emloft.net>,
Linux Crypto Mailing List <linux-crypto@...r.kernel.org>,
Linux Kernel Mailing List <linux-kernel@...r.kernel.org>,
ravi.v.shankar@...el.com, tim.c.chen@...el.com,
andi.kleen@...el.com, dave.hansen@...el.com,
wajdi.k.feghali@...el.com, greg.b.tucker@...el.com,
robert.a.kasten@...el.com, rajendrakumar.chinnaiyan@...el.com,
tomasz.kantecki@...el.com, ryan.d.saffores@...el.com,
ilya.albrekht@...el.com, kyung.min.park@...el.com,
Tony Luck <tony.luck@...el.com>, ira.weiny@...el.com
Subject: Re: [RFC V1 3/7] crypto: ghash - Optimized GHASH computations
On Fri, 18 Dec 2020 at 22:07, Megha Dey <megha.dey@...el.com> wrote:
>
> From: Kyung Min Park <kyung.min.park@...el.com>
>
> Optimize GHASH computations with the 512 bit wide VPCLMULQDQ instructions.
> The new instruction allows to work on 4 x 16 byte blocks at the time.
> For best parallelism and deeper out of order execution, the main loop of
> the code works on 16 x 16 byte blocks at the time and performs reduction
> every 48 x 16 byte blocks. Such approach needs 48 precomputed GHASH subkeys
> and the precompute operation has been optimized as well to leverage 512 bit
> registers, parallel carry less multiply and reduction.
>
> VPCLMULQDQ instruction is used to accelerate the most time-consuming
> part of GHASH, carry-less multiplication. VPCLMULQDQ instruction
> with AVX-512F adds EVEX encoded 512 bit version of PCLMULQDQ instruction.
>
> The glue code in ghash_clmulni_intel module overrides existing PCLMULQDQ
> version with the VPCLMULQDQ version when the following criteria are met:
> At compile time:
> 1. CONFIG_CRYPTO_AVX512 is enabled
> 2. toolchain(assembler) supports VPCLMULQDQ instructions
> At runtime:
> 1. VPCLMULQDQ and AVX512VL features are supported on a platform (currently
> only Icelake)
> 2. If compiled as built-in module, ghash_clmulni_intel.use_avx512 is set at
> boot time or /sys/module/ghash_clmulni_intel/parameters/use_avx512 is set
> to 1 after boot.
> If compiled as loadable module, use_avx512 module parameter must be set:
> modprobe ghash_clmulni_intel use_avx512=1
>
> With new implementation, tcrypt ghash speed test shows about 4x to 10x
> speedup improvement for GHASH calculation compared to the original
> implementation with PCLMULQDQ when the bytes per update size is 256 Bytes
> or above. Detailed results for a variety of block sizes and update
> sizes are in the table below. The test was performed on Icelake based
> platform with constant frequency set for CPU.
>
> The average performance improvement of the AVX512 version over the current
> implementation is as follows:
> For bytes per update >= 1KB, we see the average improvement of 882%(~8.8x).
> For bytes per update < 1KB, we see the average improvement of 370%(~3.7x).
>
> A typical run of tcrypt with GHASH calculation with PCLMULQDQ instruction
> and VPCLMULQDQ instruction shows the following results.
>
> ---------------------------------------------------------------------------
> | | | cycles/operation | |
> | | | (the lower the better) | |
> | byte | bytes |----------------------------------| percentage |
> | blocks | per update | GHASH test | GHASH test | loss/gain |
> | | | with PCLMULQDQ | with VPCLMULQDQ | |
> |------------|------------|----------------|-----------------|------------|
> | 16 | 16 | 144 | 233 | -38.0 |
> | 64 | 16 | 535 | 709 | -24.5 |
> | 64 | 64 | 210 | 146 | 43.8 |
> | 256 | 16 | 1808 | 1911 | -5.4 |
> | 256 | 64 | 865 | 581 | 48.9 |
> | 256 | 256 | 682 | 170 | 301.0 |
> | 1024 | 16 | 6746 | 6935 | -2.7 |
> | 1024 | 256 | 2829 | 714 | 296.0 |
> | 1024 | 1024 | 2543 | 341 | 645.0 |
> | 2048 | 16 | 13219 | 13403 | -1.3 |
> | 2048 | 256 | 5435 | 1408 | 286.0 |
> | 2048 | 1024 | 5218 | 685 | 661.0 |
> | 2048 | 2048 | 5061 | 565 | 796.0 |
> | 4096 | 16 | 40793 | 27615 | 47.8 |
> | 4096 | 256 | 10662 | 2689 | 297.0 |
> | 4096 | 1024 | 10196 | 1333 | 665.0 |
> | 4096 | 4096 | 10049 | 1011 | 894.0 |
> | 8192 | 16 | 51672 | 54599 | -5.3 |
> | 8192 | 256 | 21228 | 5284 | 301.0 |
> | 8192 | 1024 | 20306 | 2556 | 694.0 |
> | 8192 | 4096 | 20076 | 2044 | 882.0 |
> | 8192 | 8192 | 20071 | 2017 | 895.0 |
> ---------------------------------------------------------------------------
>
> This work was inspired by the AES GCM mode optimization published
> in Intel Optimized IPSEC Cryptographic library.
> https://github.com/intel/intel-ipsec-mb/lib/avx512/gcm_vaes_avx512.asm
>
> Co-developed-by: Greg Tucker <greg.b.tucker@...el.com>
> Signed-off-by: Greg Tucker <greg.b.tucker@...el.com>
> Co-developed-by: Tomasz Kantecki <tomasz.kantecki@...el.com>
> Signed-off-by: Tomasz Kantecki <tomasz.kantecki@...el.com>
> Signed-off-by: Kyung Min Park <kyung.min.park@...el.com>
> Co-developed-by: Megha Dey <megha.dey@...el.com>
> Signed-off-by: Megha Dey <megha.dey@...el.com>
Hello Megha,
What is the purpose of this separate GHASH module? GHASH is only used
in combination with AES-CTR to produce GCM, and this series already
contains a GCM driver.
Do cores exist that implement PCLMULQDQ but not AES-NI?
If not, I think we should be able to drop this patch (and remove the
existing PCLMULQDQ GHASH driver as well)
Powered by blists - more mailing lists