[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAHk-=whf+_rWROqPUMr=Do0n1ADhkEeEFL0tY+M60TJZtdrq2A@mail.gmail.com>
Date: Fri, 7 Aug 2020 12:21:36 -0700
From: Linus Torvalds <torvalds@...ux-foundation.org>
To: Andy Lutomirski <luto@...capital.net>
Cc: Willy Tarreau <w@....eu>, Marc Plumb <lkml.mplumb@...il.com>,
"Theodore Ts'o" <tytso@....edu>, Netdev <netdev@...r.kernel.org>,
Amit Klein <aksecurity@...il.com>,
Eric Dumazet <edumazet@...gle.com>,
"Jason A. Donenfeld" <Jason@...c4.com>,
Andrew Lutomirski <luto@...nel.org>,
Kees Cook <keescook@...omium.org>,
Thomas Gleixner <tglx@...utronix.de>,
Peter Zijlstra <peterz@...radead.org>,
stable <stable@...r.kernel.org>
Subject: Re: Flaw in "random32: update the net random state on interrupt and activity"
On Fri, Aug 7, 2020 at 12:08 PM Andy Lutomirski <luto@...capital.net> wrote:
>
> > On Aug 7, 2020, at 11:10 AM, Linus Torvalds <torvalds@...ux-foundation.org> wrote:
> >
> >
> > I tried something very much like that in user space to just see how
> > many cycles it ended up being.
> >
> > I made a "just raw ChaCha20", and it was already much too slow for
> > what some of the networking people claim to want.
>
> Do you remember the numbers?
Sorry, no. I wrote a hacky thing in user space, and threw it away.
> Certainly a full ChaCha20 per random number is too much, but AFAICT the network folks want 16 or 32 bits at a time, which is 1/16 or 1/8 of a ChaCha20.
That's what I did (well, I did just the 32-bit one), basically
emulating percpu accesses for incrementing the offset (I didn't
actually *do* percpu accesses, I just did a single-threaded run and
used globals, but wrote it with wrappers so that it would look like it
might work).
> DJB claims 4 cycles per byte on Core 2
I took the reference C implementation as-is, and just compiled it with
O2, so my numbers may not be what some heavily optimized case does.
But it was way more than that, even when amortizing for "only need to
do it every 8 cases". I think the 4 cycles/byte might be some "zero
branch mispredicts" case when you've fully unrolled the thing, but
then you'll be taking I$ misses out of the wazoo, since by definition
this won't be in your L1 I$ at all (only called every 8 times).
Sure, it might look ok on microbenchmarks where it does stay hot the
cache all the time, but that's not realistic. I
Linus
Powered by blists - more mailing lists