[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CANn89iKS6fas9O74U5w1wb+8DN==fXRKQ8nzq0tkT_VOXRtYBQ@mail.gmail.com>
Date: Fri, 1 Nov 2019 15:30:56 -0700
From: Eric Dumazet <edumazet@...gle.com>
To: Cong Wang <xiyou.wangcong@...il.com>
Cc: netdev <netdev@...r.kernel.org>,
Thomas Gleixner <tglx@...utronix.de>
Subject: Re: [RFC Patch] tcp: make icsk_retransmit_timer pinned
On Fri, Nov 1, 2019 at 3:16 PM Cong Wang <xiyou.wangcong@...il.com> wrote:
>
> While investigating the spinlock contention on resetting TCP
> retransmit timer:
>
> 61.72% 61.71% swapper [kernel.kallsyms] [k] queued_spin_lock_slowpath
> ...
> - 58.83% tcp_v4_rcv
> - 58.80% tcp_v4_do_rcv
> - 58.80% tcp_rcv_established
> - 52.88% __tcp_push_pending_frames
> - 52.88% tcp_write_xmit
> - 28.16% tcp_event_new_data_sent
> - 28.15% sk_reset_timer
> + mod_timer
> - 24.68% tcp_schedule_loss_probe
> - 24.68% sk_reset_timer
> + 24.68% mod_timer
>
> it turns out to be a serious timer migration issue. After collecting timer_start
> trace events for tcp_write_timer, it shows more than 77% times this timer got
> migrated to a difference CPU:
>
> $ perl -ne 'if (/\[(\d+)\].* cpu=(\d+)/){print if $1 != $2 ;}' tcp_timer_trace.txt | wc -l
> 1303826
> $ wc -l tcp_timer_trace.txt
> 1681068 tcp_timer_trace.txt
> $ python
> Python 2.7.5 (default, Jul 11 2019, 17:13:53)
> [GCC 4.8.5 20150623 (Red Hat 4.8.5-36)] on linux2
> Type "help", "copyright", "credits" or "license" for more information.
> >>> 1303826 / 1681068.0
> 0.7755938486723916
>
> And all of those migration happened during an idle CPU serving a network RX
> softirq. So, the logic of testing CPU idleness in idle_cpu() is false positive.
> I don't know whether we should relax it for this scenario particuarly, something
> like:
>
> - if (!idle_cpu(cpu) && housekeeping_cpu(cpu, HK_FLAG_TIMER))
> + if ((!idle_cpu(cpu) || in_serving_softirq()) &&
> + housekeeping_cpu(cpu, HK_FLAG_TIMER))
> return cpu;
>
> (There could be better way than in_serving_softirq() to measure the idleness,
> of course.)
>
> Or simply just make the TCP retransmit timer pinned. At least this approach
> has the minimum impact.
>
> Cc: Thomas Gleixner <tglx@...utronix.de>
> Cc: Eric Dumazet <edumazet@...gle.com>
> Signed-off-by: Cong Wang <xiyou.wangcong@...il.com>
> ---
> net/ipv4/inet_connection_sock.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/net/ipv4/inet_connection_sock.c b/net/ipv4/inet_connection_sock.c
> index eb30fc1770de..de5510ddb1c8 100644
> --- a/net/ipv4/inet_connection_sock.c
> +++ b/net/ipv4/inet_connection_sock.c
> @@ -507,7 +507,7 @@ void inet_csk_init_xmit_timers(struct sock *sk,
> {
> struct inet_connection_sock *icsk = inet_csk(sk);
>
> - timer_setup(&icsk->icsk_retransmit_timer, retransmit_handler, 0);
> + timer_setup(&icsk->icsk_retransmit_timer, retransmit_handler, TIMER_PINNED);
> timer_setup(&icsk->icsk_delack_timer, delack_handler, 0);
> timer_setup(&sk->sk_timer, keepalive_handler, 0);
> icsk->icsk_pending = icsk->icsk_ack.pending = 0;
> --
> 2.21.0
>
Now you are talking ...
We have disabled /proc/sys/kernel/timer_migration on all Google servers,
because this made no sense on servers really, and not only for tcp timers.
This has been a hot topic years ago ( random example :
https://lore.kernel.org/patchwork/patch/947052/ )
Powered by blists - more mailing lists