linux-kernel - Re: [PATCH net-next] tcp: Improve setsockopt() TCP_USER

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <CAGHK07AOPe_uoSweBwGa5a30uiqqgmUFPP_aUOP20Y8o0dCwSg@mail.gmail.com>
Date:   Tue, 3 Jul 2018 14:42:26 +1000
From:   Jonathan Maxwell <jmaxwell37@...il.com>
To:     Neal Cardwell <ncardwell@...gle.com>
Cc:     David Miller <davem@...emloft.net>,
        Eric Dumazet <edumazet@...gle.com>,
        Alexey Kuznetsov <kuznet@....inr.ac.ru>,
        Hideaki YOSHIFUJI <yoshfuji@...ux-ipv6.org>,
        Netdev <netdev@...r.kernel.org>,
        LKML <linux-kernel@...r.kernel.org>,
        Jon Maxwell <jmaxwell@...hat.com>
Subject: Re: [PATCH net-next] tcp: Improve setsockopt() TCP_USER_TIMEOUT accuracy

On Tue, Jul 3, 2018 at 1:15 PM, Jonathan Maxwell <jmaxwell37@...il.com> wrote:
> On Tue, Jul 3, 2018 at 1:00 PM, Neal Cardwell <ncardwell@...gle.com> wrote:
>> On Mon, Jul 2, 2018 at 9:18 PM Jon Maxwell <jmaxwell37@...il.com> wrote:
>>>
>>> Every time the TCP retransmission timer fires. It checks to see if there is a
>>> timeout before scheduling the next retransmit timer. The retransmit interval
>>> between each retransmission increases exponentially. The issue is that in order
>>> for the timeout to occur the retransmit timer needs to fire again. If the user
>>> timeout check happens after the 9th retransmit for example. It needs to wait for
>>> the 10th retransmit timer to fire in order to evaluate whether a timeout has
>>> occurred or not. If the interval is large enough then the timeout will be
>>> inaccurate.
>>>
>>> For example with a TCP_USER_TIMEOUT of 10 seconds without patch:
>>>
>>> 1st retransmit:
>>>
>>> 22:25:18.973488 IP host1.49310 > host2.search-agent: Flags [.]
>>>
>>> Last retransmit:
>>>
>>> 22:25:26.205499 IP host1.49310 > host2.search-agent: Flags [.]
>>>
>>> Timeout:
>>>
>>> send: Connection timed out
>>> Sun Jul  1 22:25:34 EDT 2018
>>>
>>> We can see that last retransmit took ~7 seconds. Which pushed the total
>>> timeout to ~15 seconds instead of the expected 10 seconds. This gets more
>>> inaccurate the larger the TCP_USER_TIMEOUT value. As the interval increases.
>>>
>>> Fix this by recalculating the last retransmit interval so that it fires when
>>> the timeout should occur. Only implement when icsk->icsk_user_timeout is set.
>>>
>>> Test results with the patch is the expected 10 second timeout:
>>>
>>> 1st retransmit:
>>>
>>> 01:37:59.022555 IP host1.49310 > host2.search-agent: Flags [.]
>>>
>>> Last retransmit:
>>>
>>> 01:38:06.486558 IP host1.49310 > host2.search-agent: Flags [.]
>>>
>>> Timeout:
>>>
>>> send: Connection timed out
>>> Mon Jul  2 01:38:09 EDT 2018
>>>
>>> Signed-off-by: Jon Maxwell <jmaxwell37@...il.com>
>>> ---
>>>  net/ipv4/tcp_timer.c | 7 +++++++
>>>  1 file changed, 7 insertions(+)
>>>
>>> diff --git a/net/ipv4/tcp_timer.c b/net/ipv4/tcp_timer.c
>>> index 3b3611729928..94491a481722 100644
>>> --- a/net/ipv4/tcp_timer.c
>>> +++ b/net/ipv4/tcp_timer.c
>>> @@ -407,6 +407,7 @@ void tcp_retransmit_timer(struct sock *sk)
>>>         struct tcp_sock *tp = tcp_sk(sk);
>>>         struct net *net = sock_net(sk);
>>>         struct inet_connection_sock *icsk = inet_csk(sk);
>>> +       __u32 time_remaining = 0;
>>>
>>>         if (tp->fastopen_rsk) {
>>>                 WARN_ON_ONCE(sk->sk_state != TCP_SYN_RECV &&
>>> @@ -535,6 +536,12 @@ void tcp_retransmit_timer(struct sock *sk)
>>>                 /* Use normal (exponential) backoff */
>>>                 icsk->icsk_rto = min(icsk->icsk_rto << 1, TCP_RTO_MAX);
>>>         }
>>> +       if (icsk->icsk_user_timeout) {
>>> +               time_remaining = jiffies_to_msecs(icsk->icsk_user_timeout) -
>>> +                              (tcp_time_stamp(tcp_sk(sk)) - tcp_sk(sk)->retrans_stamp);
>>> +               if (time_remaining < icsk->icsk_rto)
>>> +                       icsk->icsk_rto = time_remaining;
>>> +       }
>>
>> Thanks, a more precise user timeout sounds nice. A couple thoughts:
>>
>> (a) The  icsk->icsk_rto is in jiffies, and the time_remaining is in
>> msecs, so it looks like there is a units mismatch here in the
>> comparisons and assignment.
>>
>> (b) It also seems like the time_remaining could be negative, because
>> (a) the icsk_user_timeout is not involved in the baseline RTO
>> calculation, so that perhaps the first RTO to fire might be beyond the
>> icsk_user_timeout AFAIK, and (b) if the machine is very busy then the
>> timer handler can be delayed beyond the targeted icsk_user_timeout.
>> But time_remaining is a __u32, and  icsk->icsk_rto is also a __u32, so
>> it seems like a negative number in time_remaining would usually be
>> treated as a very large unsigned positive number in this comparison:
>>
>> +               if (time_remaining < icsk->icsk_rto)
>>
>> (c) If the user timeout is changed between RTO expirations to push the
>> user timeout further in the future,  then it seems like this commit
>> will have side effects that left the icsk->icsk_rto in a weird state
>> that does not do the expected exponential backoff correctly.
>>
>> (d) There are also wrapping issues to watch out for, since the
>> tcp_time_stamp(tcp_sk(sk)) and tcp_sk(sk)->retrans_stamp are
>> milliseconds, which will wrap every 49 days or so. Seems like the code
>> is OK in that respect.
>>
>> (e) It also might be nice to put this logic in a helper, rather than
>> growing the body of tcp_retransmit_timer().
>>
>> What about something like (pseudocode):
>>
>> --
>>
>> static __u32 tcp_clamp_rto_to_user_timeout(sk):
>>    rto = icsk->icsk_rto;
>>    if (!icsk->icsk_user_timeout)
>>       return rto;
>>    elapsed = tcp_time_stamp(tcp_sk(sk)) - tcp_sk(sk)->retrans_stamp;
>>    user_timeout = jiffies_to_msecs(icsk->icsk_user_timeout);
>>    if (elapsed >= user_timeout)
>>       rto = 1;  /* user timeout has passed; fire ASAP */
>>    else
>>       rto = min(rto, msecs_to_jiffies(user_timeout - elapsed));
>>    return rto;
>>
>> tcp_retransmit_timer():
>> ...
>>   rto = tcp_clamp_rto_to_user_timeout(sk);
>>   inet_csk_reset_xmit_timer(sk, ICSK_TIME_RETRANS, rto, TCP_RTO_MAX);
>>
>
> Thanks Neal, that looks like a good idea. Let me test that out in my reproducer.
>
> Regards
>
> Jon
>

Thanks for your input and suggestions Neal.

Results were positive in the reproducer. I'll tidy the patch up a bit and submit
as v1 with your ideas.

>> --
>>
>> neal