[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAMj1kXHDqD29TzE=2cw55qeKrnybgkYFCdy4jU_4E=OaUOkZNg@mail.gmail.com>
Date: Wed, 16 Oct 2024 09:12:41 +0200
From: Ard Biesheuvel <ardb@...nel.org>
To: Eric Biggers <ebiggers@...nel.org>
Cc: Ard Biesheuvel <ardb+git@...gle.com>, linux-arm-kernel@...ts.infradead.org,
linux-kernel@...r.kernel.org, linux-crypto@...r.kernel.org,
herbert@...dor.apana.org.au, will@...nel.org, catalin.marinas@....com,
Kees Cook <kees@...nel.org>
Subject: Re: [PATCH 2/2] arm64/crc32: Implement 4-way interleave using PMULL
On Wed, 16 Oct 2024 at 05:03, Eric Biggers <ebiggers@...nel.org> wrote:
>
> On Tue, Oct 15, 2024 at 12:41:40PM +0200, Ard Biesheuvel wrote:
> > From: Ard Biesheuvel <ardb@...nel.org>
> >
> > Now that kernel mode NEON no longer disables preemption, using FP/SIMD
> > in library code which is not obviously part of the crypto subsystem is
> > no longer problematic, as it will no longer incur unexpected latencies.
> >
> > So accelerate the CRC-32 library code on arm64 to use a 4-way
> > interleave, using PMULL instructions to implement the folding.
> >
> > On Apple M2, this results in a speedup of 2 - 2.8x when using input
> > sizes of 1k - 8k. For smaller sizes, the overhead of preserving and
> > restoring the FP/SIMD register file may not be worth it, so 1k is used
> > as a threshold for choosing this code path.
> >
> > The coefficient tables were generated using code provided by Eric. [0]
> >
> > [0] https://github.com/ebiggers/libdeflate/blob/master/scripts/gen_crc32_multipliers.c
> >
> > Cc: Eric Biggers <ebiggers@...nel.org>
> > Signed-off-by: Ard Biesheuvel <ardb@...nel.org>
> > ---
> > arch/arm64/lib/Makefile | 2 +-
> > arch/arm64/lib/crc32-glue.c | 36 +++
> > arch/arm64/lib/crc32-pmull.S | 240 ++++++++++++++++++++
> > 3 files changed, 277 insertions(+), 1 deletion(-)
>
> Thanks for doing this! The new code looks good to me. 4-way does seem like the
> right choice for arm64.
>
Agreed.
> I'd recommend calling the file crc32-4way.S and the functions
> crc32*_arm64_4way(), rather than crc32-pmull.S and crc32*_pmull(). This would
> avoid confusion with a CRC implementation that is actually based entirely on
> pmull (which is possible).
I'm well aware :-)
commit 8fefde90e90c9f5c2770e46ceb127813d3f20c34
Author: Ard Biesheuvel <ardb@...nel.org>
Date: Mon Dec 5 18:42:27 2016 +0000
crypto: arm64/crc32 - accelerated support based on x86 SSE implementation
commit 598b7d41e544322c8c4f3737ee8ddf905a44175e
Author: Ard Biesheuvel <ardb@...nel.org>
Date: Mon Aug 27 13:02:45 2018 +0200
crypto: arm64/crc32 - remove PMULL based CRC32 driver
I removed it because it wasn't actually faster, although that might be
different on modern cores.
> The proposed implementation uses the crc32
> instructions to do most of the work and only uses pmull for combining the CRCs.
> Yes, crc32c-pcl-intel-asm_64.S made this same mistake, but it is a mistake, IMO.
>
Yeah good point.
Powered by blists - more mailing lists