lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <CAMj1kXHZ9vaMHNvBb1ByjazwqJANv0pATGE7q0X=RmgsxS5i+w@mail.gmail.com>
Date: Wed, 12 Nov 2025 11:32:23 +0100
From: Ard Biesheuvel <ardb@...nel.org>
To: Eric Biggers <ebiggers@...nel.org>
Cc: linux-crypto@...r.kernel.org, linux-kernel@...r.kernel.org, 
	"Jason A . Donenfeld" <Jason@...c4.com>, Herbert Xu <herbert@...dor.apana.org.au>, 
	linux-arm-kernel@...ts.infradead.org, x86@...nel.org
Subject: Re: [PATCH 2/9] lib/crypto: polyval: Add POLYVAL library

On Tue, 11 Nov 2025 at 20:47, Eric Biggers <ebiggers@...nel.org> wrote:
>
> On Tue, Nov 11, 2025 at 08:42:29AM +0100, Ard Biesheuvel wrote:
> > On Mon, 10 Nov 2025 at 16:21, Ard Biesheuvel <ardb@...nel.org> wrote:
> > >
> > > Hi,
> > >
> > > On Mon, 10 Nov 2025 at 00:49, Eric Biggers <ebiggers@...nel.org> wrote:
> > > >
> > > > Add support for POLYVAL to lib/crypto/.
> > > >
> > > > This will replace the polyval crypto_shash algorithm and its use in the
> > > > hctr2 template, simplifying the code and reducing overhead.
> > > >
> > > > Specifically, this commit introduces the POLYVAL library API and a
> > > > generic implementation of it.  Later commits will migrate the existing
> > > > architecture-optimized implementations of POLYVAL into lib/crypto/ and
> > > > add a KUnit test suite.
> > > >
> > > > I've also rewritten the generic implementation completely, using a more
> > > > modern approach instead of the traditional table-based approach.  It's
> > > > now constant-time, requires no precomputation or dynamic memory
> > > > allocations, decreases the per-key memory usage from 4096 bytes to 16
> > > > bytes, and is faster than the old polyval-generic even on bulk data
> > > > reusing the same key (at least on x86_64, where I measured 15% faster).
> > > > We should do this for GHASH too, but for now just do it for POLYVAL.
> > > >
> > >
> > > Very nice.
> > >
> > > GHASH might suffer on 32-bit, I suppose, but taking this approach at
> > > least on 64-bit also for GHASH would be a huge improvement.
> > >
> > > I had a stab at replacing the int128 arithmetic with
> > > __builtin_bitreverse64(), but it seems to make little difference (and
> > > GCC does not support it [yet]). I've tried both arm64 and x86, and the
> > > perf delta (using your kunit benchmark) is negligible in either case.
> >
> > Sigh. I intended to only apply the generic patch and the kunit test,
> > but applied the whole series in the end, which explains perfectly why
> > x86_64 and arm64 performance are identical, given that the generic
> > code isn't even used.
> >
> > So trying this again, on a Cortex-A72 without Crypto Extensions, I do
> > get a ~30% performance improvement doing the below. I haven't
> > re-tested x86, but given that it does not appear to have a native
> > scalar bit reverse instruction (or __builtin_bitreverse64() is broken
> > for it), there is probably no point in finding out.
> >
> > Not saying we should do this for POLYVAL, but something to keep in
> > mind for gf128mul.c perhaps.
> >
> >
> > --- a/lib/crypto/polyval.c
> > +++ b/lib/crypto/polyval.c
> > @@ -42,11 +42,48 @@
> >   * 256-bit => 128-bit reduction algorithm.
> >   */
> >
> > -#ifdef CONFIG_ARCH_SUPPORTS_INT128
> > +#if defined(CONFIG_ARCH_SUPPORTS_INT128) ||
> > __has_builtin(__builtin_bitreverse64)
> >
> >  /* Do a 64 x 64 => 128 bit carryless multiplication. */
> >  static void clmul64(u64 a, u64 b, u64 *out_lo, u64 *out_hi)
> >  {
> > +       u64 a0 = a & 0x1111111111111111;
> > +       u64 a1 = a & 0x2222222222222222;
> > +       u64 a2 = a & 0x4444444444444444;
> > +       u64 a3 = a & 0x8888888888888888;
> > +
> > +       u64 b0 = b & 0x1111111111111111;
> > +       u64 b1 = b & 0x2222222222222222;
> > +       u64 b2 = b & 0x4444444444444444;
> > +       u64 b3 = b & 0x8888888888888888;
> > +
> > +#if __has_builtin(__builtin_bitreverse64)
> > +#define brev64 __builtin_bitreverse64
> > +       u64 c0 = (a0 * b0) ^ (a1 * b3) ^ (a2 * b2) ^ (a3 * b1);
> > +       u64 c1 = (a0 * b1) ^ (a1 * b0) ^ (a2 * b3) ^ (a3 * b2);
> > +       u64 c2 = (a0 * b2) ^ (a1 * b1) ^ (a2 * b0) ^ (a3 * b3);
> > +       u64 c3 = (a0 * b3) ^ (a1 * b2) ^ (a2 * b1) ^ (a3 * b0);
> > +
> > +       a0 = brev64(a0);
> > +       a1 = brev64(a1);
> > +       a2 = brev64(a2);
> > +       a3 = brev64(a3);
> > +
> > +       b0 = brev64(b0);
> > +       b1 = brev64(b1);
> > +       b2 = brev64(b2);
> > +       b3 = brev64(b3);
> > +
> > +       u64 d0 = (a0 * b0) ^ (a1 * b3) ^ (a2 * b2) ^ (a3 * b1);
> > +       u64 d1 = (a0 * b1) ^ (a1 * b0) ^ (a2 * b3) ^ (a3 * b2);
> > +       u64 d2 = (a0 * b2) ^ (a1 * b1) ^ (a2 * b0) ^ (a3 * b3);
> > +       u64 d3 = (a0 * b3) ^ (a1 * b2) ^ (a2 * b1) ^ (a3 * b0);
> > +
> > +       *out_hi = ((brev64(d0) >> 1) & 0x1111111111111111) ^
> > +                 ((brev64(d1) >> 1) & 0x2222222222222222) ^
> > +                 ((brev64(d2) >> 1) & 0x4444444444444444) ^
> > +                 ((brev64(d3) >> 1) & 0x8888888888888888);
>
> Yeah, that's an interesting idea!  So if we bit-reflect the inputs, do
> an n x n => n multiplication, and bit-reflect the output and right-shift
> it by 1, we get the high half of the desired n x n => 2n multiplication.
> (This relies on the fact that carries are being discarded.)  Then we
> don't need an instruction that does an n x n => 2n multiplication or
> produces the high half of it.
>
> The availability of hardware bit-reversal is limited, though.  arm32,
> arm64, and mips32r6 have it.  But all of those also have a "multiply
> high" instruction.  So the 30% performance improvement you saw on arm64
> seems surprising to me, as umulh should have been used.  (I verified
> that it's indeed used in the generated asm with both gcc and clang.)
>

Yeah - it might be just the compiler making a mess of things. GCC is
already considerably faster than Clang doing the u128 arithmetic (75
vs 67 MB/s on RPi4). But the bit reverse code manages 85 MB/s [which
is only 26% faster btw, so a bit less than when I tried this the other
day].

I re-tested Apple M2 (which doesn't need this code, but for
comparison), and there the GCC generated u128 code is as fast or
slightly faster than the Clang generated bitreverse code.

So I guess this is more a matter of fixing the u128 related codegen on Clang.


> The available bit-reversal abstractions aren't too great either, with
> __builtin_bitreverse64() being clang-specific and <linux/bitrev.h>
> having a table-based, i.e. non-constant-time, fallback.  So presumably
> we'd need to add our own which is guaranteed to use the actual
> instructions and not some slow and/or table-based fallback.
>

__builtin_bitreverse64() does exist on x86 too, but generates a huge
pile of code, so the mere availability is not a good reason to use it
either.

> I'll definitely look into this more later when bringing this improvement
> to GHASH too.  But for now I think we should go with the version I have
> in my patch.
>

For now, definitely. And I'll see if I can file a Clang bug somewhere.
But I don't think we'll be making use of bitreverse for GHASH either.

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ