netdev - Re: [RFC Patch] tcp: make icsk_retransmit

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <CAM_iQpUGAaV9hsP4Z7YoHD6rQuJDSP_WNk_-d97Uxyed2SsgrA@mail.gmail.com>
Date:   Fri, 1 Nov 2019 15:43:19 -0700
From:   Cong Wang <xiyou.wangcong@...il.com>
To:     Eric Dumazet <edumazet@...gle.com>
Cc:     netdev <netdev@...r.kernel.org>,
        Thomas Gleixner <tglx@...utronix.de>
Subject: Re: [RFC Patch] tcp: make icsk_retransmit_timer pinned

On Fri, Nov 1, 2019 at 3:31 PM Eric Dumazet <edumazet@...gle.com> wrote:
>
> On Fri, Nov 1, 2019 at 3:16 PM Cong Wang <xiyou.wangcong@...il.com> wrote:
> >
> > While investigating the spinlock contention on resetting TCP
> > retransmit timer:
> >
> >   61.72%    61.71%  swapper          [kernel.kallsyms]                        [k] queued_spin_lock_slowpath
> >    ...
> >     - 58.83% tcp_v4_rcv
> >       - 58.80% tcp_v4_do_rcv
> >          - 58.80% tcp_rcv_established
> >             - 52.88% __tcp_push_pending_frames
> >                - 52.88% tcp_write_xmit
> >                   - 28.16% tcp_event_new_data_sent
> >                      - 28.15% sk_reset_timer
> >                         + mod_timer
> >                   - 24.68% tcp_schedule_loss_probe
> >                      - 24.68% sk_reset_timer
> >                         + 24.68% mod_timer
> >
> > it turns out to be a serious timer migration issue. After collecting timer_start
> > trace events for tcp_write_timer, it shows more than 77% times this timer got
> > migrated to a difference CPU:
> >
> >         $ perl -ne 'if (/\[(\d+)\].* cpu=(\d+)/){print if $1 != $2 ;}' tcp_timer_trace.txt | wc -l
> >         1303826
> >         $ wc -l tcp_timer_trace.txt
> >         1681068 tcp_timer_trace.txt
> >         $ python
> >         Python 2.7.5 (default, Jul 11 2019, 17:13:53)
> >         [GCC 4.8.5 20150623 (Red Hat 4.8.5-36)] on linux2
> >         Type "help", "copyright", "credits" or "license" for more information.
> >         >>> 1303826 / 1681068.0
> >         0.7755938486723916
> >
> > And all of those migration happened during an idle CPU serving a network RX
> > softirq.  So, the logic of testing CPU idleness in idle_cpu() is false positive.
> > I don't know whether we should relax it for this scenario particuarly, something
> > like:
> >
> > -       if (!idle_cpu(cpu) && housekeeping_cpu(cpu, HK_FLAG_TIMER))
> > +       if ((!idle_cpu(cpu) || in_serving_softirq()) &&
> > +           housekeeping_cpu(cpu, HK_FLAG_TIMER))
> >                 return cpu;
> >
> > (There could be better way than in_serving_softirq() to measure the idleness,
> > of course.)
> >
> > Or simply just make the TCP retransmit timer pinned. At least this approach
> > has the minimum impact.
> >
> > Cc: Thomas Gleixner <tglx@...utronix.de>
> > Cc: Eric Dumazet <edumazet@...gle.com>
> > Signed-off-by: Cong Wang <xiyou.wangcong@...il.com>
> > ---
> >  net/ipv4/inet_connection_sock.c | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> >
> > diff --git a/net/ipv4/inet_connection_sock.c b/net/ipv4/inet_connection_sock.c
> > index eb30fc1770de..de5510ddb1c8 100644
> > --- a/net/ipv4/inet_connection_sock.c
> > +++ b/net/ipv4/inet_connection_sock.c
> > @@ -507,7 +507,7 @@ void inet_csk_init_xmit_timers(struct sock *sk,
> >  {
> >         struct inet_connection_sock *icsk = inet_csk(sk);
> >
> > -       timer_setup(&icsk->icsk_retransmit_timer, retransmit_handler, 0);
> > +       timer_setup(&icsk->icsk_retransmit_timer, retransmit_handler, TIMER_PINNED);
> >         timer_setup(&icsk->icsk_delack_timer, delack_handler, 0);
> >         timer_setup(&sk->sk_timer, keepalive_handler, 0);
> >         icsk->icsk_pending = icsk->icsk_ack.pending = 0;
> > --
> > 2.21.0
> >
>
> Now you are talking ...
>
> We have disabled /proc/sys/kernel/timer_migration on all Google servers,
> because this made no sense on servers really, and not only for tcp timers.

So let's make the sysctl timer_migration disabled by default? It is
always how we want to trade off CPU power saving with latency.

Did you measure how much CPU power it increases after disabling it?
If not much, we can certainly make it disabled by default.

>
> This has been a hot topic years ago ( random example :
> https://lore.kernel.org/patchwork/patch/947052/ )

Yeah, this specific patch has been merged for a long time,
but I know you are not just talking about this single one. :)

Thanks.