netdev - Re: [RFC Patch] tcp: make icsk_retransmit

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <CANn89iKS6fas9O74U5w1wb+8DN==fXRKQ8nzq0tkT_VOXRtYBQ@mail.gmail.com>
Date:   Fri, 1 Nov 2019 15:30:56 -0700
From:   Eric Dumazet <edumazet@...gle.com>
To:     Cong Wang <xiyou.wangcong@...il.com>
Cc:     netdev <netdev@...r.kernel.org>,
        Thomas Gleixner <tglx@...utronix.de>
Subject: Re: [RFC Patch] tcp: make icsk_retransmit_timer pinned

On Fri, Nov 1, 2019 at 3:16 PM Cong Wang <xiyou.wangcong@...il.com> wrote:
>
> While investigating the spinlock contention on resetting TCP
> retransmit timer:
>
>   61.72%    61.71%  swapper          [kernel.kallsyms]                        [k] queued_spin_lock_slowpath
>    ...
>     - 58.83% tcp_v4_rcv
>       - 58.80% tcp_v4_do_rcv
>          - 58.80% tcp_rcv_established
>             - 52.88% __tcp_push_pending_frames
>                - 52.88% tcp_write_xmit
>                   - 28.16% tcp_event_new_data_sent
>                      - 28.15% sk_reset_timer
>                         + mod_timer
>                   - 24.68% tcp_schedule_loss_probe
>                      - 24.68% sk_reset_timer
>                         + 24.68% mod_timer
>
> it turns out to be a serious timer migration issue. After collecting timer_start
> trace events for tcp_write_timer, it shows more than 77% times this timer got
> migrated to a difference CPU:
>
>         $ perl -ne 'if (/\[(\d+)\].* cpu=(\d+)/){print if $1 != $2 ;}' tcp_timer_trace.txt | wc -l
>         1303826
>         $ wc -l tcp_timer_trace.txt
>         1681068 tcp_timer_trace.txt
>         $ python
>         Python 2.7.5 (default, Jul 11 2019, 17:13:53)
>         [GCC 4.8.5 20150623 (Red Hat 4.8.5-36)] on linux2
>         Type "help", "copyright", "credits" or "license" for more information.
>         >>> 1303826 / 1681068.0
>         0.7755938486723916
>
> And all of those migration happened during an idle CPU serving a network RX
> softirq.  So, the logic of testing CPU idleness in idle_cpu() is false positive.
> I don't know whether we should relax it for this scenario particuarly, something
> like:
>
> -       if (!idle_cpu(cpu) && housekeeping_cpu(cpu, HK_FLAG_TIMER))
> +       if ((!idle_cpu(cpu) || in_serving_softirq()) &&
> +           housekeeping_cpu(cpu, HK_FLAG_TIMER))
>                 return cpu;
>
> (There could be better way than in_serving_softirq() to measure the idleness,
> of course.)
>
> Or simply just make the TCP retransmit timer pinned. At least this approach
> has the minimum impact.
>
> Cc: Thomas Gleixner <tglx@...utronix.de>
> Cc: Eric Dumazet <edumazet@...gle.com>
> Signed-off-by: Cong Wang <xiyou.wangcong@...il.com>
> ---
>  net/ipv4/inet_connection_sock.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/net/ipv4/inet_connection_sock.c b/net/ipv4/inet_connection_sock.c
> index eb30fc1770de..de5510ddb1c8 100644
> --- a/net/ipv4/inet_connection_sock.c
> +++ b/net/ipv4/inet_connection_sock.c
> @@ -507,7 +507,7 @@ void inet_csk_init_xmit_timers(struct sock *sk,
>  {
>         struct inet_connection_sock *icsk = inet_csk(sk);
>
> -       timer_setup(&icsk->icsk_retransmit_timer, retransmit_handler, 0);
> +       timer_setup(&icsk->icsk_retransmit_timer, retransmit_handler, TIMER_PINNED);
>         timer_setup(&icsk->icsk_delack_timer, delack_handler, 0);
>         timer_setup(&sk->sk_timer, keepalive_handler, 0);
>         icsk->icsk_pending = icsk->icsk_ack.pending = 0;
> --
> 2.21.0
>

Now you are talking ...

We have disabled /proc/sys/kernel/timer_migration on all Google servers,
because this made no sense on servers really, and not only for tcp timers.

This has been a hot topic years ago ( random example :
https://lore.kernel.org/patchwork/patch/947052/ )