netdev - Re: 126 ms irqsoff Latency - Possibly due to commit 190cc82489f4 ("tcp: change source port randomizarion at connect() time")

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <CANn89iK3maLVo_G7MGswuXV0Og9tEFJxMZt+34ZKTo4zUNoLRw@mail.gmail.com>
Date:   Sat, 1 Oct 2022 15:31:15 -0700
From:   Eric Dumazet <edumazet@...gle.com>
To:     "Jason A. Donenfeld" <Jason@...c4.com>
Cc:     Christophe Leroy <christophe.leroy@...roup.eu>,
        "netdev@...r.kernel.org" <netdev@...r.kernel.org>,
        David Dworken <ddworken@...gle.com>,
        Willem de Bruijn <willemb@...gle.com>,
        "David S. Miller" <davem@...emloft.net>,
        Sebastian Andrzej Siewior <bigeasy@...utronix.de>
Subject: Re: 126 ms irqsoff Latency - Possibly due to commit 190cc82489f4
 ("tcp: change source port randomizarion at connect() time")

On Sat, Oct 1, 2022 at 3:16 PM Jason A. Donenfeld <Jason@...c4.com> wrote:
>
> (CC+Sebastian)
>
> Hi Eric, Christophe,
>
> I'm trying to understand the context of this and whether/why there's a
> problem. Some overview on how get_random_bytes() works:
>
> Most of the time, get_random_bytes() is completely lockless and operates
> over per-CPU data structures. get_random_bytes() calls
> _get_random_bytes(), which calls crng_make_state(), and then operates
> over stack data to churn out some random bytes. crng_make_state() is
> where all the meat happens.
>
> In crng_make_state(), there are three unlikely conditionals where locks
> are taken. The first is:
>
>     if (!crng_ready()) {
>         ... do some expensive things involving locks ...
>         ... but only during early boot before the rng is initialized ...
>     }
>
> The second one is:
>
>     if (unlikely(time_is_before_jiffies(READ_ONCE(base_crng.birth) + crng_reseed_interval()))) {
>         ... do something less expensive involving locks ...
>         ... which happens approximately once per minute ...
>     }
>
> The third one is:
>
>     if (unlikely(crng->generation != READ_ONCE(base_crng.generation))) {
>         ... do something even less expensive involving locks ...
>         ... which happens when after a different cpu hit the above ...
>     }
>
> So all three of these conditions are pretty darn unlikely, with the
> exception of the first one that happens all the time during early boot
> before the RNG is initialized, after which it is static-branched out and
> never triggers again. So as far as /locks/ are concerned, things should
> be good here.
>
> However, in order to operate on per-cpu data, and therefore be lockless
> most of the time, it does take a "local lock", which is basically just
> disabling interrupts on non-RT to do a short operation:
>
>     local_lock_irqsave(&crngs.lock, flags);
>     crng = raw_cpu_ptr(&crngs);
>     crng_fast_key_erasure(...);
>     local_unlock_irqrestore(&crngs.lock, flags);
>
> crng_fast_key_erasure(), in turn, computes a single block of chacha20,
> which should be relatively fast. So the critical section is very short
> there.
>
> The reason that's local_lock_irqsave() rather than local_lock() (which
> would only disable preemption, I believe), is because IRQ handlers are
> supposed to be able to have access to random bytes too. It seems like it
> wouldn't be a super nice thing to remove that capability.
>
> It might be possible to double the amount of per-cpu data and have a
> separate state for IRQ than for non-IRQ, but that seems kind of wasteful
> and complex/hairy to implement.
>
> So that leads me to wonder more about the context: why does this matter?
> It looks like you're hitting this from a DO_ONCE() thing, which are
> usually only hit, as the name says, once, and then incur the overhead of
> firing off a worker to change the once-static-branch, which means
> DO_ONCE()es aren't very fast anyway? Or does that not accurately reflect
> what's happening?
>
> I'll also CC Sebastian here, who worked with me on that local lock and
> might have some insights on IRQ latency as well.

Sorry Jason, it seems I forgot to CC you on the tentative patch I sent
earlier today

https://patchwork.kernel.org/project/netdevbpf/patch/20221001205102.2319658-1-eric.dumazet@gmail.com/