netdev - Re: [PATCH net-next] tcp: Improve setsockopt() TCP_USER

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:   Mon, 2 Jul 2018 23:00:03 -0400
From:   Neal Cardwell <ncardwell@...gle.com>
To:     jmaxwell37@...il.com
Cc:     David Miller <davem@...emloft.net>,
        Eric Dumazet <edumazet@...gle.com>,
        Alexey Kuznetsov <kuznet@....inr.ac.ru>,
        Hideaki YOSHIFUJI <yoshfuji@...ux-ipv6.org>,
        Netdev <netdev@...r.kernel.org>,
        LKML <linux-kernel@...r.kernel.org>, jmaxwell@...hat.com
Subject: Re: [PATCH net-next] tcp: Improve setsockopt() TCP_USER_TIMEOUT accuracy

On Mon, Jul 2, 2018 at 9:18 PM Jon Maxwell <jmaxwell37@...il.com> wrote:
>
> Every time the TCP retransmission timer fires. It checks to see if there is a
> timeout before scheduling the next retransmit timer. The retransmit interval
> between each retransmission increases exponentially. The issue is that in order
> for the timeout to occur the retransmit timer needs to fire again. If the user
> timeout check happens after the 9th retransmit for example. It needs to wait for
> the 10th retransmit timer to fire in order to evaluate whether a timeout has
> occurred or not. If the interval is large enough then the timeout will be
> inaccurate.
>
> For example with a TCP_USER_TIMEOUT of 10 seconds without patch:
>
> 1st retransmit:
>
> 22:25:18.973488 IP host1.49310 > host2.search-agent: Flags [.]
>
> Last retransmit:
>
> 22:25:26.205499 IP host1.49310 > host2.search-agent: Flags [.]
>
> Timeout:
>
> send: Connection timed out
> Sun Jul  1 22:25:34 EDT 2018
>
> We can see that last retransmit took ~7 seconds. Which pushed the total
> timeout to ~15 seconds instead of the expected 10 seconds. This gets more
> inaccurate the larger the TCP_USER_TIMEOUT value. As the interval increases.
>
> Fix this by recalculating the last retransmit interval so that it fires when
> the timeout should occur. Only implement when icsk->icsk_user_timeout is set.
>
> Test results with the patch is the expected 10 second timeout:
>
> 1st retransmit:
>
> 01:37:59.022555 IP host1.49310 > host2.search-agent: Flags [.]
>
> Last retransmit:
>
> 01:38:06.486558 IP host1.49310 > host2.search-agent: Flags [.]
>
> Timeout:
>
> send: Connection timed out
> Mon Jul  2 01:38:09 EDT 2018
>
> Signed-off-by: Jon Maxwell <jmaxwell37@...il.com>
> ---
>  net/ipv4/tcp_timer.c | 7 +++++++
>  1 file changed, 7 insertions(+)
>
> diff --git a/net/ipv4/tcp_timer.c b/net/ipv4/tcp_timer.c
> index 3b3611729928..94491a481722 100644
> --- a/net/ipv4/tcp_timer.c
> +++ b/net/ipv4/tcp_timer.c
> @@ -407,6 +407,7 @@ void tcp_retransmit_timer(struct sock *sk)
>         struct tcp_sock *tp = tcp_sk(sk);
>         struct net *net = sock_net(sk);
>         struct inet_connection_sock *icsk = inet_csk(sk);
> +       __u32 time_remaining = 0;
>
>         if (tp->fastopen_rsk) {
>                 WARN_ON_ONCE(sk->sk_state != TCP_SYN_RECV &&
> @@ -535,6 +536,12 @@ void tcp_retransmit_timer(struct sock *sk)
>                 /* Use normal (exponential) backoff */
>                 icsk->icsk_rto = min(icsk->icsk_rto << 1, TCP_RTO_MAX);
>         }
> +       if (icsk->icsk_user_timeout) {
> +               time_remaining = jiffies_to_msecs(icsk->icsk_user_timeout) -
> +                              (tcp_time_stamp(tcp_sk(sk)) - tcp_sk(sk)->retrans_stamp);
> +               if (time_remaining < icsk->icsk_rto)
> +                       icsk->icsk_rto = time_remaining;
> +       }

Thanks, a more precise user timeout sounds nice. A couple thoughts:

(a) The  icsk->icsk_rto is in jiffies, and the time_remaining is in
msecs, so it looks like there is a units mismatch here in the
comparisons and assignment.

(b) It also seems like the time_remaining could be negative, because
(a) the icsk_user_timeout is not involved in the baseline RTO
calculation, so that perhaps the first RTO to fire might be beyond the
icsk_user_timeout AFAIK, and (b) if the machine is very busy then the
timer handler can be delayed beyond the targeted icsk_user_timeout.
But time_remaining is a __u32, and  icsk->icsk_rto is also a __u32, so
it seems like a negative number in time_remaining would usually be
treated as a very large unsigned positive number in this comparison:

+               if (time_remaining < icsk->icsk_rto)

(c) If the user timeout is changed between RTO expirations to push the
user timeout further in the future,  then it seems like this commit
will have side effects that left the icsk->icsk_rto in a weird state
that does not do the expected exponential backoff correctly.

(d) There are also wrapping issues to watch out for, since the
tcp_time_stamp(tcp_sk(sk)) and tcp_sk(sk)->retrans_stamp are
milliseconds, which will wrap every 49 days or so. Seems like the code
is OK in that respect.

(e) It also might be nice to put this logic in a helper, rather than
growing the body of tcp_retransmit_timer().

What about something like (pseudocode):

--

static __u32 tcp_clamp_rto_to_user_timeout(sk):
   rto = icsk->icsk_rto;
   if (!icsk->icsk_user_timeout)
      return rto;
   elapsed = tcp_time_stamp(tcp_sk(sk)) - tcp_sk(sk)->retrans_stamp;
   user_timeout = jiffies_to_msecs(icsk->icsk_user_timeout);
   if (elapsed >= user_timeout)
      rto = 1;  /* user timeout has passed; fire ASAP */
   else
      rto = min(rto, msecs_to_jiffies(user_timeout - elapsed));
   return rto;

tcp_retransmit_timer():
...
  rto = tcp_clamp_rto_to_user_timeout(sk);
  inet_csk_reset_xmit_timer(sk, ICSK_TIME_RETRANS, rto, TCP_RTO_MAX);

--

neal