linux-kernel - Re: [PATCH] x86/crc32: optimize tail handling for crc32c short inputs

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite for Android: free password hash cracker in your pocket

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20250305142653.751d9840@pumpkin>
Date: Wed, 5 Mar 2025 14:26:53 +0000
From: David Laight <david.laight.linux@...il.com>
To: Eric Biggers <ebiggers@...nel.org>
Cc: linux-kernel@...r.kernel.org, Bill Wendling <morbo@...gle.com>, Thomas
 Gleixner <tglx@...utronix.de>, Ingo Molnar <mingo@...hat.com>, Borislav
 Petkov <bp@...en8.de>, Dave Hansen <dave.hansen@...ux.intel.com>,
 x86@...nel.org, "H . Peter Anvin" <hpa@...or.com>, Ard Biesheuvel
 <ardb@...nel.org>, Nathan Chancellor <nathan@...nel.org>, Nick Desaulniers
 <nick.desaulniers+lkml@...il.com>, Justin Stitt <justinstitt@...gle.com>,
 linux-crypto@...r.kernel.org, llvm@...ts.linux.dev
Subject: Re: [PATCH] x86/crc32: optimize tail handling for crc32c short
 inputs

On Tue,  4 Mar 2025 13:32:16 -0800
Eric Biggers <ebiggers@...nel.org> wrote:

> From: Eric Biggers <ebiggers@...gle.com>
> 
> For handling the 0 <= len < sizeof(unsigned long) bytes left at the end,
> do a 4-2-1 step-down instead of a byte-at-a-time loop.  This allows
> taking advantage of wider CRC instructions.  Note that crc32c-3way.S
> already uses this same optimization too.

An alternative is to add extra zero bytes at the start of the buffer.
They don't affect the crc and just need the first 8 bytes shifted left.

I think any non-zero 'crc-in' just needs to be xor'ed over the first
4 actual data bytes.
(It's over 40 years since I did the maths of CRC.)

You won't notice the misaligned accesses all down the buffer.
When I was testing different ipcsum code misaligned buffers
cost less than 1 clock per cache line.
I think that was even true for the versions that managed 12 bytes
per clock (including the one Linus committed).

	David