lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  PHC 
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:   Fri, 7 Aug 2020 12:21:36 -0700
From:   Linus Torvalds <>
To:     Andy Lutomirski <>
Cc:     Willy Tarreau <>, Marc Plumb <>,
        "Theodore Ts'o" <>, Netdev <>,
        Amit Klein <>,
        Eric Dumazet <>,
        "Jason A. Donenfeld" <>,
        Andrew Lutomirski <>,
        Kees Cook <>,
        Thomas Gleixner <>,
        Peter Zijlstra <>,
        stable <>
Subject: Re: Flaw in "random32: update the net random state on interrupt and activity"

On Fri, Aug 7, 2020 at 12:08 PM Andy Lutomirski <> wrote:
> > On Aug 7, 2020, at 11:10 AM, Linus Torvalds <> wrote:
> >
> >
> > I tried something very much like that in user space to just see how
> > many cycles it ended up being.
> >
> > I made a "just raw ChaCha20", and it was already much too slow for
> > what some of the networking people claim to want.
> Do you remember the numbers?

Sorry, no. I wrote a hacky thing in user space, and threw it away.

> Certainly a full ChaCha20 per random number is too much, but AFAICT the network folks want 16 or 32 bits at a time, which is 1/16 or 1/8 of a ChaCha20.

That's what I did (well, I did just the 32-bit one), basically
emulating percpu accesses for incrementing the offset (I didn't
actually *do* percpu accesses, I just did a single-threaded run and
used globals, but wrote it with wrappers so that it would look like it
might work).

> DJB claims 4 cycles per byte on Core 2

I took the reference C implementation as-is, and just compiled it with
O2, so my numbers may not be what some heavily optimized case does.

But it was way more than that, even when amortizing for "only need to
do it every 8 cases". I think the 4 cycles/byte might be some "zero
branch mispredicts" case when you've fully unrolled the thing, but
then you'll be taking I$ misses out of the wazoo, since by definition
this won't be in your L1 I$ at all (only called every 8 times).

Sure, it might look ok on microbenchmarks where it does stay hot the
cache all the time, but that's not realistic. I


Powered by blists - more mailing lists