linux-kernel - Re: [RFC V1 3/7] crypto: ghash - Optimized GHASH computations

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <dfb5f2e0-027d-2b9c-aec7-313ff0275381@intel.com>
Date:   Fri, 15 Jan 2021 16:14:40 -0800
From:   "Dey, Megha" <megha.dey@...el.com>
To:     Ard Biesheuvel <ardb@...nel.org>
Cc:     Herbert Xu <herbert@...dor.apana.org.au>,
        "David S. Miller" <davem@...emloft.net>,
        Linux Crypto Mailing List <linux-crypto@...r.kernel.org>,
        Linux Kernel Mailing List <linux-kernel@...r.kernel.org>,
        ravi.v.shankar@...el.com, tim.c.chen@...el.com,
        andi.kleen@...el.com, dave.hansen@...el.com,
        wajdi.k.feghali@...el.com, greg.b.tucker@...el.com,
        robert.a.kasten@...el.com, rajendrakumar.chinnaiyan@...el.com,
        tomasz.kantecki@...el.com, ryan.d.saffores@...el.com,
        ilya.albrekht@...el.com, kyung.min.park@...el.com,
        Tony Luck <tony.luck@...el.com>, ira.weiny@...el.com
Subject: Re: [RFC V1 3/7] crypto: ghash - Optimized GHASH computations

Hi Ard,

On 12/19/2020 9:03 AM, Ard Biesheuvel wrote:
> On Fri, 18 Dec 2020 at 22:07, Megha Dey <megha.dey@...el.com> wrote:
>> From: Kyung Min Park <kyung.min.park@...el.com>
>>
>> Optimize GHASH computations with the 512 bit wide VPCLMULQDQ instructions.
>> The new instruction allows to work on 4 x 16 byte blocks at the time.
>> For best parallelism and deeper out of order execution, the main loop of
>> the code works on 16 x 16 byte blocks at the time and performs reduction
>> every 48 x 16 byte blocks. Such approach needs 48 precomputed GHASH subkeys
>> and the precompute operation has been optimized as well to leverage 512 bit
>> registers, parallel carry less multiply and reduction.
>>
>> VPCLMULQDQ instruction is used to accelerate the most time-consuming
>> part of GHASH, carry-less multiplication. VPCLMULQDQ instruction
>> with AVX-512F adds EVEX encoded 512 bit version of PCLMULQDQ instruction.
>>
>> The glue code in ghash_clmulni_intel module overrides existing PCLMULQDQ
>> version with the VPCLMULQDQ version when the following criteria are met:
>> At compile time:
>> 1. CONFIG_CRYPTO_AVX512 is enabled
>> 2. toolchain(assembler) supports VPCLMULQDQ instructions
>> At runtime:
>> 1. VPCLMULQDQ and AVX512VL features are supported on a platform (currently
>>     only Icelake)
>> 2. If compiled as built-in module, ghash_clmulni_intel.use_avx512 is set at
>>     boot time or /sys/module/ghash_clmulni_intel/parameters/use_avx512 is set
>>     to 1 after boot.
>>     If compiled as loadable module, use_avx512 module parameter must be set:
>>     modprobe ghash_clmulni_intel use_avx512=1
>>
>> With new implementation, tcrypt ghash speed test shows about 4x to 10x
>> speedup improvement for GHASH calculation compared to the original
>> implementation with PCLMULQDQ when the bytes per update size is 256 Bytes
>> or above. Detailed results for a variety of block sizes and update
>> sizes are in the table below. The test was performed on Icelake based
>> platform with constant frequency set for CPU.
>>
>> The average performance improvement of the AVX512 version over the current
>> implementation is as follows:
>> For bytes per update >= 1KB, we see the average improvement of 882%(~8.8x).
>> For bytes per update < 1KB, we see the average improvement of 370%(~3.7x).
>>
>> A typical run of tcrypt with GHASH calculation with PCLMULQDQ instruction
>> and VPCLMULQDQ instruction shows the following results.
>>
>> ---------------------------------------------------------------------------
>> |            |            |         cycles/operation         |            |
>> |            |            |       (the lower the better)     |            |
>> |    byte    |   bytes    |----------------------------------| percentage |
>> |   blocks   | per update |   GHASH test   |   GHASH test    | loss/gain  |
>> |            |            | with PCLMULQDQ | with VPCLMULQDQ |            |
>> |------------|------------|----------------|-----------------|------------|
>> |      16    |     16     |       144      |        233      |   -38.0    |
>> |      64    |     16     |       535      |        709      |   -24.5    |
>> |      64    |     64     |       210      |        146      |    43.8    |
>> |     256    |     16     |      1808      |       1911      |    -5.4    |
>> |     256    |     64     |       865      |        581      |    48.9    |
>> |     256    |    256     |       682      |        170      |   301.0    |
>> |    1024    |     16     |      6746      |       6935      |    -2.7    |
>> |    1024    |    256     |      2829      |        714      |   296.0    |
>> |    1024    |   1024     |      2543      |        341      |   645.0    |
>> |    2048    |     16     |     13219      |      13403      |    -1.3    |
>> |    2048    |    256     |      5435      |       1408      |   286.0    |
>> |    2048    |   1024     |      5218      |        685      |   661.0    |
>> |    2048    |   2048     |      5061      |        565      |   796.0    |
>> |    4096    |     16     |     40793      |      27615      |    47.8    |
>> |    4096    |    256     |     10662      |       2689      |   297.0    |
>> |    4096    |   1024     |     10196      |       1333      |   665.0    |
>> |    4096    |   4096     |     10049      |       1011      |   894.0    |
>> |    8192    |     16     |     51672      |      54599      |    -5.3    |
>> |    8192    |    256     |     21228      |       5284      |   301.0    |
>> |    8192    |   1024     |     20306      |       2556      |   694.0    |
>> |    8192    |   4096     |     20076      |       2044      |   882.0    |
>> |    8192    |   8192     |     20071      |       2017      |   895.0    |
>> ---------------------------------------------------------------------------
>>
>> This work was inspired by the AES GCM mode optimization published
>> in Intel Optimized IPSEC Cryptographic library.
>> https://github.com/intel/intel-ipsec-mb/lib/avx512/gcm_vaes_avx512.asm
>>
>> Co-developed-by: Greg Tucker <greg.b.tucker@...el.com>
>> Signed-off-by: Greg Tucker <greg.b.tucker@...el.com>
>> Co-developed-by: Tomasz Kantecki <tomasz.kantecki@...el.com>
>> Signed-off-by: Tomasz Kantecki <tomasz.kantecki@...el.com>
>> Signed-off-by: Kyung Min Park <kyung.min.park@...el.com>
>> Co-developed-by: Megha Dey <megha.dey@...el.com>
>> Signed-off-by: Megha Dey <megha.dey@...el.com>
> Hello Megha,
>
> What is the purpose of this separate GHASH module? GHASH is only used
> in combination with AES-CTR to produce GCM, and this series already
> contains a GCM driver.
>
> Do cores exist that implement PCLMULQDQ but not AES-NI?
>
> If not, I think we should be able to drop this patch (and remove the
> existing PCLMULQDQ GHASH driver as well)

AFAIK, dm-verity (authenticated but not encrypted file system) is one 
use case for authentication only.

Although I am not sure if GHASH is specifically used for this or SHA?

Also, I do not know of any cores that implement PCLMULQDQ and not AES-NI.

Megha