[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <CAMj1kXGr38moMGUj-p1m3u=N=RZ7XKfnCfPebd1+rTGgwmJfKQ@mail.gmail.com>
Date: Sat, 8 Feb 2025 10:07:20 +0100
From: Ard Biesheuvel <ardb@...nel.org>
To: Eric Biggers <ebiggers@...nel.org>
Cc: linux-crypto@...r.kernel.org, x86@...nel.org, linux-kernel@...r.kernel.org
Subject: Re: [PATCH v3] crypto: x86/aes-ctr - rewrite AESNI+AVX optimized CTR
and add VAES support
Hi Eric,
On Sat, 8 Feb 2025 at 04:52, Eric Biggers <ebiggers@...nel.org> wrote:
>
> From: Eric Biggers <ebiggers@...gle.com>
>
> Delete aes_ctrby8_avx-x86_64.S and add a new assembly file
> aes-ctr-avx-x86_64.S which follows a similar approach to
> aes-xts-avx-x86_64.S in that it uses a "template" to provide AESNI+AVX,
> VAES+AVX2, VAES+AVX10/256, and VAES+AVX10/512 code, instead of just
> AESNI+AVX. Wire it up to the crypto API accordingly.
>
> This greatly improves the performance of AES-CTR and AES-XCTR on
> VAES-capable CPUs, with the best case being AMD Zen 5 where an over 230%
> increase in throughput is seen on long messages. Performance on
> non-VAES-capable CPUs remains about the same, and the non-AVX AES-CTR
> code (aesni_ctr_enc) is also kept as-is for now. There are some slight
> regressions (less than 10%) on some short message lengths on some CPUs;
> these are difficult to avoid, given how the previous code was so heavily
> unrolled by message length, and they are not particularly important.
> Detailed performance results are given in the tables below.
>
> Both CTR and XCTR support is retained. The main loop remains
> 8-vector-wide, which differs from the 4-vector-wide main loops that are
> used in the XTS and GCM code. A wider loop is appropriate for CTR and
> XCTR since they have fewer other instructions (such as vpclmulqdq) to
> interleave with the AES instructions.
>
> Similar to what was the case for AES-GCM, the new assembly code also has
> a much smaller binary size, as it fixes the excessive unrolling by data
> length and key length present in the old code. Specifically, the new
> assembly file compiles to about 9 KB of text vs. 28 KB for the old file.
> This is despite 4x as many implementations being included.
>
> The tables below show the detailed performance results. The tables show
> percentage improvement in single-threaded throughput for repeated
> encryption of the given message length; an increase from 6000 MB/s to
> 12000 MB/s would be listed as 100%. They were collected by directly
> measuring the Linux crypto API performance using a custom kernel module.
> The tested CPUs were all server processors from Google Compute Engine
> except for Zen 5 which was a Ryzen 9 9950X desktop processor.
>
> Table 1: AES-256-CTR throughput improvement,
> CPU microarchitecture vs. message length in bytes:
>
> | 16384 | 4096 | 4095 | 1420 | 512 | 500 |
> ---------------------+-------+-------+-------+-------+-------+-------+
> AMD Zen 5 | 232% | 203% | 212% | 143% | 71% | 95% |
> Intel Emerald Rapids | 116% | 116% | 117% | 91% | 78% | 79% |
> Intel Ice Lake | 109% | 103% | 107% | 81% | 54% | 56% |
> AMD Zen 4 | 109% | 91% | 100% | 70% | 43% | 59% |
> AMD Zen 3 | 92% | 78% | 87% | 57% | 32% | 43% |
> AMD Zen 2 | 9% | 8% | 14% | 12% | 8% | 21% |
> Intel Skylake | 7% | 7% | 8% | 5% | 3% | 8% |
>
> | 300 | 200 | 64 | 63 | 16 |
> ---------------------+-------+-------+-------+-------+-------+
> AMD Zen 5 | 57% | 39% | -9% | 7% | -7% |
> Intel Emerald Rapids | 37% | 42% | -0% | 13% | -8% |
> Intel Ice Lake | 39% | 30% | -1% | 14% | -9% |
> AMD Zen 4 | 42% | 38% | -0% | 18% | -3% |
> AMD Zen 3 | 38% | 35% | 6% | 31% | 5% |
> AMD Zen 2 | 24% | 23% | 5% | 30% | 3% |
> Intel Skylake | 9% | 1% | -4% | 10% | -7% |
>
> Table 2: AES-256-XCTR throughput improvement,
> CPU microarchitecture vs. message length in bytes:
>
> | 16384 | 4096 | 4095 | 1420 | 512 | 500 |
> ---------------------+-------+-------+-------+-------+-------+-------+
> AMD Zen 5 | 240% | 201% | 216% | 151% | 75% | 108% |
> Intel Emerald Rapids | 100% | 99% | 102% | 91% | 94% | 104% |
> Intel Ice Lake | 93% | 89% | 92% | 74% | 50% | 64% |
> AMD Zen 4 | 86% | 75% | 83% | 60% | 41% | 52% |
> AMD Zen 3 | 73% | 63% | 69% | 45% | 21% | 33% |
> AMD Zen 2 | -2% | -2% | 2% | 3% | -1% | 11% |
> Intel Skylake | -1% | -1% | 1% | 2% | -1% | 9% |
>
> | 300 | 200 | 64 | 63 | 16 |
> ---------------------+-------+-------+-------+-------+-------+
> AMD Zen 5 | 78% | 56% | -4% | 38% | -2% |
> Intel Emerald Rapids | 61% | 55% | 4% | 32% | -5% |
> Intel Ice Lake | 57% | 42% | 3% | 44% | -4% |
> AMD Zen 4 | 35% | 28% | -1% | 17% | -3% |
> AMD Zen 3 | 26% | 23% | -3% | 11% | -6% |
> AMD Zen 2 | 13% | 24% | -1% | 14% | -3% |
> Intel Skylake | 16% | 8% | -4% | 35% | -3% |
>
> Signed-off-by: Eric Biggers <ebiggers@...gle.com>
> ---
>
Very nice results! One remark below.
...
> diff --git a/arch/x86/crypto/aes-ctr-avx-x86_64.S b/arch/x86/crypto/aes-ctr-avx-x86_64.S
> new file mode 100644
> index 0000000000000..25cab1d8e63f9
> --- /dev/null
> +++ b/arch/x86/crypto/aes-ctr-avx-x86_64.S
> @@ -0,0 +1,592 @@
> +/* SPDX-License-Identifier: Apache-2.0 OR BSD-2-Clause */
> +//
> +// Copyright 2025 Google LLC
> +//
> +// Author: Eric Biggers <ebiggers@...gle.com>
> +//
> +// This file is dual-licensed, meaning that you can use it under your choice of
> +// either of the following two licenses:
> +//
> +// Licensed under the Apache License 2.0 (the "License"). You may obtain a copy
> +// of the License at
> +//
> +// http://www.apache.org/licenses/LICENSE-2.0
> +//
> +// Unless required by applicable law or agreed to in writing, software
> +// distributed under the License is distributed on an "AS IS" BASIS,
> +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
> +// See the License for the specific language governing permissions and
> +// limitations under the License.
> +//
> +// or
> +//
> +// Redistribution and use in source and binary forms, with or without
> +// modification, are permitted provided that the following conditions are met:
> +//
> +// 1. Redistributions of source code must retain the above copyright notice,
> +// this list of conditions and the following disclaimer.
> +//
> +// 2. Redistributions in binary form must reproduce the above copyright
> +// notice, this list of conditions and the following disclaimer in the
> +// documentation and/or other materials provided with the distribution.
> +//
> +// THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
> +// AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
> +// IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
> +// ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE
> +// LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR
> +// CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF
> +// SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS
> +// INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN
> +// CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE)
> +// ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE
> +// POSSIBILITY OF SUCH DAMAGE.
> +//
> +//------------------------------------------------------------------------------
> +//
> +// This file contains x86_64 assembly implementations of AES-CTR and AES-XCTR
> +// using the following sets of CPU features:
> +// - AES-NI && AVX
> +// - VAES && AVX2
> +// - VAES && (AVX10/256 || (AVX512BW && AVX512VL)) && BMI2
> +// - VAES && (AVX10/512 || (AVX512BW && AVX512VL)) && BMI2
> +//
> +// See the function definitions at the bottom of the file for more information.
> +
> +#include <linux/linkage.h>
> +#include <linux/cfi_types.h>
> +
> +.section .rodata
> +.p2align 4
> +
> +.Lbswap_mask:
> + .octa 0x000102030405060708090a0b0c0d0e0f
> +
> +.Lctr_pattern:
> + .quad 0, 0
> +.Lone:
> + .quad 1, 0
> +.Ltwo:
> + .quad 2, 0
> + .quad 3, 0
> +
> +.Lfour:
> + .quad 4, 0
> +
> +.text
> +
> +// Move a vector between memory and a register.
> +// The register operand must be in the first 16 vector registers.
> +.macro _vmovdqu src, dst
> +.if VL < 64
> + vmovdqu \src, \dst
> +.else
> + vmovdqu8 \src, \dst
> +.endif
> +.endm
> +
> +// Move a vector between registers.
> +// The registers must be in the first 16 vector registers.
> +.macro _vmovdqa src, dst
> +.if VL < 64
> + vmovdqa \src, \dst
> +.else
> + vmovdqa64 \src, \dst
> +.endif
> +.endm
> +
> +// Broadcast a 128-bit value from memory to all 128-bit lanes of a vector
> +// register. The register operand must be in the first 16 vector registers.
> +.macro _vbroadcast128 src, dst
> +.if VL == 16
> + vmovdqu \src, \dst
> +.elseif VL == 32
> + vbroadcasti128 \src, \dst
> +.else
> + vbroadcasti32x4 \src, \dst
> +.endif
> +.endm
> +
> +// XOR two vectors together.
> +// Any register operands must be in the first 16 vector registers.
> +.macro _vpxor src1, src2, dst
> +.if VL < 64
> + vpxor \src1, \src2, \dst
> +.else
> + vpxord \src1, \src2, \dst
> +.endif
> +.endm
> +
> +// Load 1 <= %ecx <= 15 bytes from the pointer \src into the xmm register \dst
> +// and zeroize any remaining bytes. Clobbers %rax, %rcx, and \tmp{64,32}.
> +.macro _load_partial_block src, dst, tmp64, tmp32
> + sub $8, %ecx // LEN - 8
> + jle .Lle8\@
> +
> + // Load 9 <= LEN <= 15 bytes.
> + vmovq (\src), \dst // Load first 8 bytes
> + mov (\src, %rcx), %rax // Load last 8 bytes
> + neg %ecx
> + shl $3, %ecx
> + shr %cl, %rax // Discard overlapping bytes
> + vpinsrq $1, %rax, \dst, \dst
> + jmp .Ldone\@
> +
> +.Lle8\@:
> + add $4, %ecx // LEN - 4
> + jl .Llt4\@
> +
> + // Load 4 <= LEN <= 8 bytes.
> + mov (\src), %eax // Load first 4 bytes
> + mov (\src, %rcx), \tmp32 // Load last 4 bytes
> + jmp .Lcombine\@
> +
> +.Llt4\@:
> + // Load 1 <= LEN <= 3 bytes.
> + add $2, %ecx // LEN - 2
> + movzbl (\src), %eax // Load first byte
> + jl .Lmovq\@
> + movzwl (\src, %rcx), \tmp32 // Load last 2 bytes
> +.Lcombine\@:
> + shl $3, %ecx
> + shl %cl, \tmp64
> + or \tmp64, %rax // Combine the two parts
> +.Lmovq\@:
> + vmovq %rax, \dst
> +.Ldone\@:
> +.endm
> +
> +// Store 1 <= %ecx <= 15 bytes from the xmm register \src to the pointer \dst.
> +// Clobbers %rax, %rcx, and \tmp{64,32}.
> +.macro _store_partial_block src, dst, tmp64, tmp32
> + sub $8, %ecx // LEN - 8
> + jl .Llt8\@
> +
> + // Store 8 <= LEN <= 15 bytes.
> + vpextrq $1, \src, %rax
> + mov %ecx, \tmp32
> + shl $3, %ecx
> + ror %cl, %rax
> + mov %rax, (\dst, \tmp64) // Store last LEN - 8 bytes
> + vmovq \src, (\dst) // Store first 8 bytes
> + jmp .Ldone\@
> +
> +.Llt8\@:
> + add $4, %ecx // LEN - 4
> + jl .Llt4\@
> +
> + // Store 4 <= LEN <= 7 bytes.
> + vpextrd $1, \src, %eax
> + mov %ecx, \tmp32
> + shl $3, %ecx
> + ror %cl, %eax
> + mov %eax, (\dst, \tmp64) // Store last LEN - 4 bytes
> + vmovd \src, (\dst) // Store first 4 bytes
> + jmp .Ldone\@
> +
> +.Llt4\@:
> + // Store 1 <= LEN <= 3 bytes.
> + vpextrb $0, \src, 0(\dst)
> + cmp $-2, %ecx // LEN - 4 == -2, i.e. LEN == 2?
> + jl .Ldone\@
> + vpextrb $1, \src, 1(\dst)
> + je .Ldone\@
> + vpextrb $2, \src, 2(\dst)
> +.Ldone\@:
> +.endm
> +
> +// Prepare the next two vectors of AES inputs in AESDATA\i0 and AESDATA\i1, and
> +// XOR each with the zero-th round key. Also update LE_CTR if !\final.
> +.macro _prepare_2_ctr_vecs is_xctr, i0, i1, final=0
> +.if \is_xctr
> + .if USE_AVX10
> + _vmovdqa LE_CTR, AESDATA\i0
> + vpternlogd $0x96, XCTR_IV, RNDKEY0, AESDATA\i0
> + .else
> + vpxor XCTR_IV, LE_CTR, AESDATA\i0
> + vpxor RNDKEY0, AESDATA\i0, AESDATA\i0
> + .endif
> + vpaddq LE_CTR_INC1, LE_CTR, AESDATA\i1
> +
> + .if USE_AVX10
> + vpternlogd $0x96, XCTR_IV, RNDKEY0, AESDATA\i1
> + .else
> + vpxor XCTR_IV, AESDATA\i1, AESDATA\i1
> + vpxor RNDKEY0, AESDATA\i1, AESDATA\i1
> + .endif
> +.else
> + vpshufb BSWAP_MASK, LE_CTR, AESDATA\i0
> + _vpxor RNDKEY0, AESDATA\i0, AESDATA\i0
> + vpaddq LE_CTR_INC1, LE_CTR, AESDATA\i1
> + vpshufb BSWAP_MASK, AESDATA\i1, AESDATA\i1
> + _vpxor RNDKEY0, AESDATA\i1, AESDATA\i1
> +.endif
> +.if !\final
> + vpaddq LE_CTR_INC2, LE_CTR, LE_CTR
> +.endif
> +.endm
> +
> +// Do all AES rounds on the data in the given AESDATA vectors, excluding the
> +// zero-th and last rounds.
> +.macro _aesenc_loop vecs
If you make this vecs:vararg, you can drop the "" around the arguments
in the callers.
> + mov KEY, %rax
> +1:
> + _vbroadcast128 (%rax), RNDKEY
> +.irp i, \vecs
> + vaesenc RNDKEY, AESDATA\i, AESDATA\i
> +.endr
> + add $16, %rax
> + cmp %rax, RNDKEYLAST_PTR
> + jne 1b
> +.endm
> +
> +// Finalize the keystream blocks in the given AESDATA vectors by doing the last
> +// AES round, then XOR those keystream blocks with the corresponding data.
> +// Reduce latency by doing the XOR before the vaesenclast, utilizing the
> +// property vaesenclast(key, a) ^ b == vaesenclast(key ^ b, a).
> +.macro _aesenclast_and_xor vecs
Same here
> +.irp i, \vecs
> + _vpxor \i*VL(SRC), RNDKEYLAST, RNDKEY
> + vaesenclast RNDKEY, AESDATA\i, AESDATA\i
> +.endr
> +.irp i, \vecs
> + _vmovdqu AESDATA\i, \i*VL(DST)
> +.endr
> +.endm
> +
...
Powered by blists - more mailing lists