linux-kernel - Re: [PATCH crypto-stable] crypto: arch/lib - limit simd usage to PAGE

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <CAMj1kXGWMmLUO7Wida0OzK0rK4wDhdmjsx6fGsyED6gsrbsjXA@mail.gmail.com>
Date:   Thu, 23 Apr 2020 10:45:59 +0200
From:   Ard Biesheuvel <ardb@...nel.org>
To:     "Jason A. Donenfeld" <Jason@...c4.com>
Cc:     Eric Biggers <ebiggers@...nel.org>,
        Herbert Xu <herbert@...dor.apana.org.au>,
        Linux Crypto Mailing List <linux-crypto@...r.kernel.org>,
        LKML <linux-kernel@...r.kernel.org>
Subject: Re: [PATCH crypto-stable] crypto: arch/lib - limit simd usage to
 PAGE_SIZE chunks

On Wed, 22 Apr 2020 at 22:17, Jason A. Donenfeld <Jason@...c4.com> wrote:
>
> On Wed, Apr 22, 2020 at 1:51 PM Jason A. Donenfeld <Jason@...c4.com> wrote:
> >
> > On Wed, Apr 22, 2020 at 1:39 AM Ard Biesheuvel <ardb@...nel.org> wrote:
> > >
> > > On Wed, 22 Apr 2020 at 09:32, Jason A. Donenfeld <Jason@...c4.com> wrote:
> > > >
> > > > On Tue, Apr 21, 2020 at 10:04 PM Eric Biggers <ebiggers@...nel.org> wrote:
> > > > > Seems this should just be a 'while' loop?
> > > > >
> > > > >         while (bytes) {
> > > > >                 unsigned int todo = min_t(unsigned int, PAGE_SIZE, bytes);
> > > > >
> > > > >                 kernel_neon_begin();
> > > > >                 chacha_doneon(state, dst, src, todo, nrounds);
> > > > >                 kernel_neon_end();
> > > > >
> > > > >                 bytes -= todo;
> > > > >                 src += todo;
> > > > >                 dst += todo;
> > > > >         }
> > > >
> > > > The for(;;) is how it's done elsewhere in the kernel (that this patch
> > > > doesn't touch), because then we can break out of the loop before
> > > > having to increment src and dst unnecessarily. Likely a pointless
> > > > optimization as probably the compiler can figure out how to avoid
> > > > that. But maybe it can't. If you have a strong preference, I can
> > > > reactor everything to use `while (bytes)`, but if you don't care,
> > > > let's keep this as-is. Opinion?
> > > >
> > >
> > > Since we're bikeshedding, I'd prefer 'do { } while (bytes);' here,
> > > given that bytes is guaranteed to be non-zero before we enter the
> > > loop. But in any case, I'd prefer avoiding for(;;) or while(1) where
> > > we can.
> >
> > Okay, will do-while it up for v2.
>
> I just sent v2 containing do-while, and I'm fine with that going in
> that way. But just in the interest of curiosity in the pan-tone
> palette, check this out:
>
> https://godbolt.org/z/VxXien
>
> It looks like on mine, the compiler avoids unnecessarily calling those
> adds on the last iteration, but on the other hand, it results in an
> otherwise unnecessary unconditional jump for the < 4096 case. Sort of
> interesting. Arm64 code is more or less the same difference too.

Yeah, even if shaving off 1 or 2 cycles mattered here (since we've
just decided that ugh() may take up to 20,000 cycles), hiding a couple
of ALU instructions in the slots between the subs (which sets the zero
flag) and the conditional branch that tests it probably comes for free
on in-order cores anyway. And even if it didn't, backwards branches
are usually statically predicted as taken, in which case their results
are actually needed.

On out-of-order cores under speculation, none of this matters anyway.