netdev - Re: 126 ms irqsoff Latency - Possibly due to commit 190cc82489f4 ("tcp: change source port randomizarion at connect() time")

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [day] [month] [year] [list]

Message-ID: <CANn89iKTzqBsphc1=6fDbTH=gmA2nVfZTGObCFETfpg03gqb7A@mail.gmail.com>
Date:   Sat, 1 Oct 2022 15:37:09 -0700
From:   Eric Dumazet <edumazet@...gle.com>
To:     "Jason A. Donenfeld" <Jason@...c4.com>
Cc:     Christophe Leroy <christophe.leroy@...roup.eu>,
        "netdev@...r.kernel.org" <netdev@...r.kernel.org>,
        David Dworken <ddworken@...gle.com>,
        Willem de Bruijn <willemb@...gle.com>,
        "David S. Miller" <davem@...emloft.net>,
        Sebastian Andrzej Siewior <bigeasy@...utronix.de>
Subject: Re: 126 ms irqsoff Latency - Possibly due to commit 190cc82489f4
 ("tcp: change source port randomizarion at connect() time")

On Sat, Oct 1, 2022 at 3:34 PM Jason A. Donenfeld <Jason@...c4.com> wrote:
>
> Hi again,
>
> Actually, ignore everything I said before. I looked more closely at the
> trace, and this seems like a bogus report. Let me explain:
>
> The part of the trace that concerns my last email is tiny:
>
>   CORSurv-352       0d..1.   36us : get_random_bytes <-__inet_hash_connect
>   CORSurv-352       0d..1.   45us : _get_random_bytes.part.0 <-__inet_hash_connect
>   CORSurv-352       0d..1.   55us : crng_make_state <-_get_random_bytes.part.0
>   CORSurv-352       0d..1.   65us+: ktime_get_seconds <-crng_make_state
>   CORSurv-352       0d..1.   77us+: crng_fast_key_erasure <-crng_make_state
>   CORSurv-352       0d..1.   89us+: chacha_block_generic <-crng_fast_key_erasure
>   CORSurv-352       0d..1.  101us+: chacha_permute <-chacha_block_generic
>
> After those lines, crng_make_state() returns back into
> _get_random_bytes(), where _get_random_bytes() proceeds to call chacha20
> totally unlocked, having released all interrupts:
>
>   CORSurv-352       0d..1.  129us : chacha_block_generic <-_get_random_bytes.part.0
>   CORSurv-352       0d..1.  139us+: chacha_permute <-chacha_block_generic
>   ...
>   CORSurv-352       0d..1. 126275us : chacha_block_generic <-_get_random_bytes.part.0
>   CORSurv-352       0d..1. 126285us+: chacha_permute <-chacha_block_generic
>
> I guess it's generating a lot of blocks, and this is a slow board?
> Either way, no interrupts are held here, and no locks either.
>
> But then let's zoom out to see if we can figure out what is disabling
> IRQs. This time, pasting from the top and the bottom of the stack trace,
> rather than from the middle:
>
>   CORSurv-352       0d....    4us : _raw_spin_lock_irqsave
>   ...
>   CORSurv-352       0d..1. 126309us : _raw_spin_unlock_irqrestore <-__do_once_done
>   CORSurv-352       0d..1. 126318us+: do_raw_spin_unlock <-_raw_spin_unlock_irqrestore
>   CORSurv-352       0d..1. 126330us+: _raw_spin_unlock_irqrestore
>   CORSurv-352       0d..1. 126346us+: trace_hardirqs_on <-_raw_spin_unlock_irqrestore
>
> Oh, hello hello __do_once_done(). Let's have a look at you:
>
>   bool __do_once_start(bool *done, unsigned long *flags)
>       __acquires(once_lock)
>   {
>       spin_lock_irqsave(&once_lock, *flags);
>       if (*done) {
>           spin_unlock_irqrestore(&once_lock, *flags);
>           /* Keep sparse happy by restoring an even lock count on
>            * this lock. In case we return here, we don't call into
>            * __do_once_done but return early in the DO_ONCE() macro.
>            */
>           __acquire(once_lock);
>           return false;
>       }
>
>       return true;
>   }
>   EXPORT_SYMBOL(__do_once_start);
>
>   void __do_once_done(bool *done, struct static_key_true *once_key,
>               unsigned long *flags, struct module *mod)
>       __releases(once_lock)
>   {
>       *done = true;
>       spin_unlock_irqrestore(&once_lock, *flags);
>       once_disable_jump(once_key, mod);
>   }
>   EXPORT_SYMBOL(__do_once_done);
>
> Well then! It looks like DO_ONCE() takes an irqsave spinlock. So, as far
> as get_random_bytes() is concerned, interrupts are not being held
> abnormally long. This is something having to do with the code that's
> calling into it. So... doesn't seem like an RNG issue?

I guess you did not read my prior email sent hours ago :)

https://lore.kernel.org/netdev/CANn89iLAEYBaoYajy0Y9UmGFff5GPxDUoG-ErVB2jDdRNQ5Tug@mail.gmail.com/T/#m43fa0bac40dfda59c2b72cb3e844b41f3cdb949d