[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <20251124224028.GA1827@quark>
Date: Mon, 24 Nov 2025 14:40:28 -0800
From: Eric Biggers <ebiggers@...nel.org>
To: "Jason A. Donenfeld" <Jason@...c4.com>
Cc: david laight <david.laight@...box.com>,
Thorsten Blum <thorsten.blum@...ux.dev>,
Ard Biesheuvel <ardb@...nel.org>, linux-crypto@...r.kernel.org,
linux-kernel@...r.kernel.org
Subject: Re: [PATCH] lib/crypto: blake2b: Limit frame size workaround to GCC
< 12.2 on i386
On Mon, Nov 24, 2025 at 06:14:31PM +0100, Jason A. Donenfeld wrote:
> On Mon, Nov 24, 2025 at 10:08 AM david laight <david.laight@...box.com> wrote:
> > > How about we roll up the BLAKE2b rounds loop if !CONFIG_64BIT?
> >
> > I do wonder about the real benefit of some of the massive loop unrolling
> > that happens in a lot of these algorithms (not just blake2b).
>
> I remember looking at this in the context of blake2s, with two paths,
> depending on CONFIG_CC_OPTIMIZE_FOR_SIZE, but the savings didn't seem
> enough for the performance hit. It might be platform specific though.
> I guess try it and post numbers, and that'll either be a compelling
> reason to adjust it or still "meh"?
Earlier I did some quick microbenchmarks with blake2b_kunit. The
existing unrolling does increase throughput by as much as 50%. It's
probably mostly due to inlining the blake2b_sigma constants.
However, the increased code size is a real issue that doesn't show up in
that microbenchmark. Naturally, it will be especially bad on 32-bit
CPUs, given that BLAKE2b works with 64-bit words. The 32-bit code gets
the code size blow-up from emulating the 64-bit arithmetic using 32-bit
instructions, in addition to the unrolling. Rolling up the rounds loop
when !CONFIG_64BIT seems like a reasonable first step.
We could consider rolling up the rounds loop even when CONFIG_64BIT. If
optimal BLAKE2b throughput was actually important on x86_64, we should
have an AVX optimized implementation anyway. But no one has ever cared
to add one. I think btrfs is the only user currently, but btrfs's use
case is non-cryptographic and it already supports much faster
non-cryptographic checksums (crc32c and xxhash64).
- Eric
Powered by blists - more mailing lists