[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <v36sousjd5ukqlkpdxslvpu7l37zbu7d7slgc2trjjqwty2bny@qgzew34feo2r>
Date: Thu, 23 Jan 2025 19:32:57 -0500
From: Kent Overstreet <kent.overstreet@...ux.dev>
To: Eric Biggers <ebiggers@...nel.org>
Cc: Linus Torvalds <torvalds@...ux-foundation.org>,
linux-crypto@...r.kernel.org, linux-kernel@...r.kernel.org, Ard Biesheuvel <ardb@...nel.org>,
Chao Yu <chao@...nel.org>, "Darrick J. Wong" <djwong@...nel.org>,
Geert Uytterhoeven <geert@...ux-m68k.org>, "Martin K. Petersen" <martin.petersen@...cle.com>,
Michael Ellerman <mpe@...erman.id.au>, Theodore Ts'o <tytso@....edu>,
Vinicius Peixoto <vpeixoto@...amp.dev>, WangYuli <wangyuli@...ontech.com>
Subject: Re: [GIT PULL] CRC updates for 6.14
On Thu, Jan 23, 2025 at 03:42:09PM -0800, Eric Biggers wrote:
> On Thu, Jan 23, 2025 at 05:36:40PM -0500, Kent Overstreet wrote:
> > On Wed, Jan 22, 2025 at 11:46:18PM -0800, Eric Biggers wrote:
> > > On Wed, Jan 22, 2025 at 09:16:33PM -0800, Eric Biggers wrote:
> > > > As you probably noticed, the other problem is that CRC32 has 4 generic
> > > > implementations: bit-by-bit, and slice by 1, 4, or 8 bytes.
> > > >
> > > > Bit-by-bit is useless. Slice by 4 and slice by 8 are too similar to have both.
> > > >
> > > > It's not straightforward to choose between slice by 1 and slice by 4/8, though.
> > > > When benchmarking slice-by-n, a higher n will always be faster in
> > > > microbenchmarks (up to about n=16), but the required table size also increases
> > > > accordingly. E.g., a slice-by-1 CRC32 uses a 1024-byte table, while slice-by-8
> > > > uses a 8192-byte table. This table is accessed randomly, which is really bad on
> > > > the dcache, and can be really bad for performance in real world scenarios where
> > > > the system is bottlenecked on memory.
> > > >
> > > > I'm tentatively planning to just say that slice-by-4 is a good enough compromise
> > > > and have that be the only generic CRC32 implementation.
> > > >
> > > > But I need to try an interleaved implementation too, since it's possible that
> > > > could give the best of both worlds.
> > >
> > > Actually, I'm tempted to just provide slice-by-1 (a.k.a. byte-by-byte) as the
> > > only generic CRC32 implementation. The generic code has become increasingly
> > > irrelevant due to the arch-optimized code existing. The arch-optimized code
> > > tends to be 10 to 100 times faster on long messages.
> > >
> > > The generic CRC32 code is still needed when the CPU features needed by the arch
> > > code are unavailable. But that's rare these days. It's also still needed when
> > > the CPU has no scalar instructions to accelerate the CRC (e.g. on x86_64, the
> > > "regular" CRC32 as opposed to the Castagnoli CRC32) *and* the message is too
> > > short for the overhead of saving and restoring the vector registers to be worth
> > > it -- typically < 64 bytes or so. And it's still needed when the CRC is done in
> > > a context where vector registers can't be used at all.
> > >
> > > But those don't feel like very strong motivations for the huge tables anymore.
> > > I think the huge tables were really intended for optimizing CRCs of long
> > > messages back when CPUs didn't have any better way to do it.
> >
> > Have you by chance been looking at performance of 64 bit crc algorithms,
> > or have any recommendations there?
> >
> > I've been starting to consider switching to a 64 bit crc for the
> > default for bcachefs - and we do want it to be a regular crc so we can
> > combine/add them together, not one of the newer fancier add/xor based
> > hashes.
> >
> > I thought we'd gotten a faster PCLMULQDQ based implementation of a
> > crc64, but looking again it appears not, hrm.
>
> Yep! I've written an assembly template that expands into a PCLMULQDQ or
> VPCLMULQDQ implementation of any variant of CRC-8, CRC-16, CRC-32, or CRC-64.
> The latest work-in-progress version can be found at
> https://git.kernel.org/pub/scm/linux/kernel/git/ebiggers/linux.git/log/?h=crc-x86.
> We just need to decide which CRC variants to wire it up for. My tentative plan,
> which is implemented in that git branch, is crc32_le, crc_t10dif, crc64_be, and
> crc64_nvme. (crc64_nvme is currently called crc64_rocksoft, but that name makes
> no sense.) For crc32_le and crc_t10dif these would replace the existing
> PCLMULQDQ implementations. crc64* on the other hand would gain acceleration for
> the first time, improving performance for those by as much as 100x. I'm also
> fixing the CRC64 lib to be organized the way I did CRC32 and CRC-T10DIF.
>
> That work is targeting 6.15, since there was already enough for 6.14.
>
> Though this approach makes it easy to wire up arbitrary CRC variants for x86, we
> do need to make sure that anyone we wire up is actually useful. The users of
> the optimized crc64_be would be drivers/md/bcache/ and fs/bcachefs/. So if you
> could confirm that that would indeed be useful for you (and in particular you
> haven't deprecated the CRC64 support in favor of xxHash), that would be helpful.
That's fantastic!
CRC64 isn't deprecated, and I suspect that will be our preferred option
at some point in the future. xxHash was added only because there is a
real desire for 64 bit checksums and crc64 wasn't a good option at the
time.
(It was also designed as a hash function, not for data protection, and
I'd want to see some actual literature evaluating it as such before I'd
consider making it the default. And like I mentioned, CRCs have other
properties we want).
Powered by blists - more mailing lists