linux-kernel - Re: [PATCH 2/2] arm64/crc32: Implement 4-way interleave using PMULL

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <CAMj1kXHDqD29TzE=2cw55qeKrnybgkYFCdy4jU_4E=OaUOkZNg@mail.gmail.com>
Date: Wed, 16 Oct 2024 09:12:41 +0200
From: Ard Biesheuvel <ardb@...nel.org>
To: Eric Biggers <ebiggers@...nel.org>
Cc: Ard Biesheuvel <ardb+git@...gle.com>, linux-arm-kernel@...ts.infradead.org, 
	linux-kernel@...r.kernel.org, linux-crypto@...r.kernel.org, 
	herbert@...dor.apana.org.au, will@...nel.org, catalin.marinas@....com, 
	Kees Cook <kees@...nel.org>
Subject: Re: [PATCH 2/2] arm64/crc32: Implement 4-way interleave using PMULL

On Wed, 16 Oct 2024 at 05:03, Eric Biggers <ebiggers@...nel.org> wrote:
>
> On Tue, Oct 15, 2024 at 12:41:40PM +0200, Ard Biesheuvel wrote:
> > From: Ard Biesheuvel <ardb@...nel.org>
> >
> > Now that kernel mode NEON no longer disables preemption, using FP/SIMD
> > in library code which is not obviously part of the crypto subsystem is
> > no longer problematic, as it will no longer incur unexpected latencies.
> >
> > So accelerate the CRC-32 library code on arm64 to use a 4-way
> > interleave, using PMULL instructions to implement the folding.
> >
> > On Apple M2, this results in a speedup of 2 - 2.8x when using input
> > sizes of 1k - 8k. For smaller sizes, the overhead of preserving and
> > restoring the FP/SIMD register file may not be worth it, so 1k is used
> > as a threshold for choosing this code path.
> >
> > The coefficient tables were generated using code provided by Eric. [0]
> >
> > [0] https://github.com/ebiggers/libdeflate/blob/master/scripts/gen_crc32_multipliers.c
> >
> > Cc: Eric Biggers <ebiggers@...nel.org>
> > Signed-off-by: Ard Biesheuvel <ardb@...nel.org>
> > ---
> >  arch/arm64/lib/Makefile      |   2 +-
> >  arch/arm64/lib/crc32-glue.c  |  36 +++
> >  arch/arm64/lib/crc32-pmull.S | 240 ++++++++++++++++++++
> >  3 files changed, 277 insertions(+), 1 deletion(-)
>
> Thanks for doing this!  The new code looks good to me.  4-way does seem like the
> right choice for arm64.
>

Agreed.

> I'd recommend calling the file crc32-4way.S and the functions
> crc32*_arm64_4way(), rather than crc32-pmull.S and crc32*_pmull().  This would
> avoid confusion with a CRC implementation that is actually based entirely on
> pmull (which is possible).

I'm well aware :-)

commit 8fefde90e90c9f5c2770e46ceb127813d3f20c34
Author: Ard Biesheuvel <ardb@...nel.org>
Date:   Mon Dec 5 18:42:27 2016 +0000

    crypto: arm64/crc32 - accelerated support based on x86 SSE implementation

commit 598b7d41e544322c8c4f3737ee8ddf905a44175e
Author: Ard Biesheuvel <ardb@...nel.org>
Date:   Mon Aug 27 13:02:45 2018 +0200

    crypto: arm64/crc32 - remove PMULL based CRC32 driver

I removed it because it wasn't actually faster, although that might be
different on modern cores.

>  The proposed implementation uses the crc32
> instructions to do most of the work and only uses pmull for combining the CRCs.
> Yes, crc32c-pcl-intel-asm_64.S made this same mistake, but it is a mistake, IMO.
>

Yeah good point.