linux-kernel - Re: [PATCH] lib/crypto: blake2b: Limit frame size workaround to GCC

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [day] [month] [year] [list]

Message-ID: <20251124224028.GA1827@quark>
Date: Mon, 24 Nov 2025 14:40:28 -0800
From: Eric Biggers <ebiggers@...nel.org>
To: "Jason A. Donenfeld" <Jason@...c4.com>
Cc: david laight <david.laight@...box.com>,
	Thorsten Blum <thorsten.blum@...ux.dev>,
	Ard Biesheuvel <ardb@...nel.org>, linux-crypto@...r.kernel.org,
	linux-kernel@...r.kernel.org
Subject: Re: [PATCH] lib/crypto: blake2b: Limit frame size workaround to GCC
 < 12.2 on i386

On Mon, Nov 24, 2025 at 06:14:31PM +0100, Jason A. Donenfeld wrote:
> On Mon, Nov 24, 2025 at 10:08 AM david laight <david.laight@...box.com> wrote:
> > > How about we roll up the BLAKE2b rounds loop if !CONFIG_64BIT?
> >
> > I do wonder about the real benefit of some of the massive loop unrolling
> > that happens in a lot of these algorithms (not just blake2b).
> 
> I remember looking at this in the context of blake2s, with two paths,
> depending on CONFIG_CC_OPTIMIZE_FOR_SIZE, but the savings didn't seem
> enough for the performance hit. It might be platform specific though.
> I guess try it and post numbers, and that'll either be a compelling
> reason to adjust it or still "meh"?

Earlier I did some quick microbenchmarks with blake2b_kunit.  The
existing unrolling does increase throughput by as much as 50%.  It's
probably mostly due to inlining the blake2b_sigma constants.

However, the increased code size is a real issue that doesn't show up in
that microbenchmark.  Naturally, it will be especially bad on 32-bit
CPUs, given that BLAKE2b works with 64-bit words.  The 32-bit code gets
the code size blow-up from emulating the 64-bit arithmetic using 32-bit
instructions, in addition to the unrolling.  Rolling up the rounds loop
when !CONFIG_64BIT seems like a reasonable first step.

We could consider rolling up the rounds loop even when CONFIG_64BIT.  If
optimal BLAKE2b throughput was actually important on x86_64, we should
have an AVX optimized implementation anyway.  But no one has ever cared
to add one.  I think btrfs is the only user currently, but btrfs's use
case is non-cryptographic and it already supports much faster
non-cryptographic checksums (crc32c and xxhash64).

- Eric