[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20250305142653.751d9840@pumpkin>
Date: Wed, 5 Mar 2025 14:26:53 +0000
From: David Laight <david.laight.linux@...il.com>
To: Eric Biggers <ebiggers@...nel.org>
Cc: linux-kernel@...r.kernel.org, Bill Wendling <morbo@...gle.com>, Thomas
Gleixner <tglx@...utronix.de>, Ingo Molnar <mingo@...hat.com>, Borislav
Petkov <bp@...en8.de>, Dave Hansen <dave.hansen@...ux.intel.com>,
x86@...nel.org, "H . Peter Anvin" <hpa@...or.com>, Ard Biesheuvel
<ardb@...nel.org>, Nathan Chancellor <nathan@...nel.org>, Nick Desaulniers
<nick.desaulniers+lkml@...il.com>, Justin Stitt <justinstitt@...gle.com>,
linux-crypto@...r.kernel.org, llvm@...ts.linux.dev
Subject: Re: [PATCH] x86/crc32: optimize tail handling for crc32c short
inputs
On Tue, 4 Mar 2025 13:32:16 -0800
Eric Biggers <ebiggers@...nel.org> wrote:
> From: Eric Biggers <ebiggers@...gle.com>
>
> For handling the 0 <= len < sizeof(unsigned long) bytes left at the end,
> do a 4-2-1 step-down instead of a byte-at-a-time loop. This allows
> taking advantage of wider CRC instructions. Note that crc32c-3way.S
> already uses this same optimization too.
An alternative is to add extra zero bytes at the start of the buffer.
They don't affect the crc and just need the first 8 bytes shifted left.
I think any non-zero 'crc-in' just needs to be xor'ed over the first
4 actual data bytes.
(It's over 40 years since I did the maths of CRC.)
You won't notice the misaligned accesses all down the buffer.
When I was testing different ipcsum code misaligned buffers
cost less than 1 clock per cache line.
I think that was even true for the versions that managed 12 bytes
per clock (including the one Linus committed).
David
Powered by blists - more mailing lists