[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20250123181818.GA2117666@google.com>
Date: Thu, 23 Jan 2025 18:18:18 +0000
From: Eric Biggers <ebiggers@...nel.org>
To: Theodore Ts'o <tytso@....edu>
Cc: Linus Torvalds <torvalds@...ux-foundation.org>,
linux-crypto@...r.kernel.org, linux-kernel@...r.kernel.org,
Ard Biesheuvel <ardb@...nel.org>, Chao Yu <chao@...nel.org>,
"Darrick J. Wong" <djwong@...nel.org>,
Geert Uytterhoeven <geert@...ux-m68k.org>,
Kent Overstreet <kent.overstreet@...ux.dev>,
"Martin K. Petersen" <martin.petersen@...cle.com>,
Michael Ellerman <mpe@...erman.id.au>,
Vinicius Peixoto <vpeixoto@...amp.dev>,
WangYuli <wangyuli@...sls0nwwnnilyahiblcmlmlcaoki5s.yundunwaf1.com>
Subject: Re: [GIT PULL] CRC updates for 6.14
On Thu, Jan 23, 2025 at 09:07:44AM -0500, Theodore Ts'o wrote:
> On Wed, Jan 22, 2025 at 11:46:18PM -0800, Eric Biggers wrote:
> >
> > Actually, I'm tempted to just provide slice-by-1 (a.k.a. byte-by-byte) as the
> > only generic CRC32 implementation. The generic code has become increasingly
> > irrelevant due to the arch-optimized code existing. The arch-optimized code
> > tends to be 10 to 100 times faster on long messages.
>
> Yeah, that's my intuition as well; I would think the CPU's that
> don't have a CRC32 optimization instruction(s) would probably be the
> most sensitive to dcache thrashing.
>
> But given that Geert ran into this on m68k (I assume), maybe we could
> have him benchmark the various crc32 generic implementation to see if
> we is the best for him? That is, assuming that he cares (which he
> might not. :-).
FWIW, benchmarking the CRC library functions is easy now; just enable
CONFIG_CRC_KUNIT_TEST=y and CONFIG_CRC_BENCHMARK=y.
But, it's just a traditional benchmark that calls the functions in a loop, and
doesn't account for dcache thrashing. It's exactly the sort of benchmark I
mentioned doesn't tell the whole story about the drawbacks of using a huge
table. So focusing only on microbenchmarks of slice-by-n generally leads to a
value n > 1 seeming optimal --- potentially as high as n=16 depending on the
CPU, but really old CPUs like m68k should need much less. So the rationale of
choosing "slice-by-1" in the kernel would be to consider the reduced dcache use
and code size, and the fact that arch-optimized code is usually used instead
these days anyway, to be more important than microbenchmark results. (And also
the other CRC variants in the kernel like CRC64, CRC-T10DIF, CRC16, etc. already
just have slice-by-1, so this would make CRC32 consistent with that.)
- Eric
Powered by blists - more mailing lists