netdev - Re: 126 ms irqsoff Latency - Possibly due to commit 190cc82489f4 ("tcp: change source port randomizarion at connect() time")

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite for Android: free password hash cracker in your pocket

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <Yzi8Md2tkSYDnF1B@zx2c4.com>
Date:   Sun, 2 Oct 2022 00:16:17 +0200
From:   "Jason A. Donenfeld" <Jason@...c4.com>
To:     Eric Dumazet <edumazet@...gle.com>
Cc:     Christophe Leroy <christophe.leroy@...roup.eu>,
        "netdev@...r.kernel.org" <netdev@...r.kernel.org>,
        David Dworken <ddworken@...gle.com>,
        Willem de Bruijn <willemb@...gle.com>,
        "David S. Miller" <davem@...emloft.net>, bigeasy@...utronix.de
Subject: Re: 126 ms irqsoff Latency - Possibly due to commit 190cc82489f4
 ("tcp: change source port randomizarion at connect() time")

(CC+Sebastian)

Hi Eric, Christophe,

I'm trying to understand the context of this and whether/why there's a
problem. Some overview on how get_random_bytes() works:

Most of the time, get_random_bytes() is completely lockless and operates
over per-CPU data structures. get_random_bytes() calls
_get_random_bytes(), which calls crng_make_state(), and then operates
over stack data to churn out some random bytes. crng_make_state() is
where all the meat happens.

In crng_make_state(), there are three unlikely conditionals where locks
are taken. The first is:

    if (!crng_ready()) {
        ... do some expensive things involving locks ...
	... but only during early boot before the rng is initialized ...
    }

The second one is:

    if (unlikely(time_is_before_jiffies(READ_ONCE(base_crng.birth) + crng_reseed_interval()))) {
        ... do something less expensive involving locks ...
	... which happens approximately once per minute ...
    }

The third one is:

    if (unlikely(crng->generation != READ_ONCE(base_crng.generation))) {
        ... do something even less expensive involving locks ...
	... which happens when after a different cpu hit the above ...
    }

So all three of these conditions are pretty darn unlikely, with the
exception of the first one that happens all the time during early boot
before the RNG is initialized, after which it is static-branched out and
never triggers again. So as far as /locks/ are concerned, things should
be good here.

However, in order to operate on per-cpu data, and therefore be lockless
most of the time, it does take a "local lock", which is basically just
disabling interrupts on non-RT to do a short operation:

    local_lock_irqsave(&crngs.lock, flags);
    crng = raw_cpu_ptr(&crngs);
    crng_fast_key_erasure(...);
    local_unlock_irqrestore(&crngs.lock, flags);

crng_fast_key_erasure(), in turn, computes a single block of chacha20,
which should be relatively fast. So the critical section is very short
there.

The reason that's local_lock_irqsave() rather than local_lock() (which
would only disable preemption, I believe), is because IRQ handlers are
supposed to be able to have access to random bytes too. It seems like it
wouldn't be a super nice thing to remove that capability.

It might be possible to double the amount of per-cpu data and have a
separate state for IRQ than for non-IRQ, but that seems kind of wasteful
and complex/hairy to implement.

So that leads me to wonder more about the context: why does this matter?
It looks like you're hitting this from a DO_ONCE() thing, which are
usually only hit, as the name says, once, and then incur the overhead of
firing off a worker to change the once-static-branch, which means
DO_ONCE()es aren't very fast anyway? Or does that not accurately reflect
what's happening?

I'll also CC Sebastian here, who worked with me on that local lock and
might have some insights on IRQ latency as well.

Regards,
Jason