netdev - Re: [PATCH] tcp: use rto_min value from socket in retransmits timeout

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-Id: <E09A2DA0-A741-4566-B8C6-09C563546538@yandex-team.ru>
Date:   Fri, 30 Jul 2021 15:37:03 +0300
From:   Dmitry Yakunin <zeil@...dex-team.ru>
To:     Neal Cardwell <ncardwell@...gle.com>
Cc:     kafai@...com, edumazet@...gle.com, netdev@...r.kernel.org,
        bpf@...r.kernel.org, dmtrmonakhov@...dex-team.ru,
        Yuchung Cheng <ycheng@...gle.com>,
        Soheil Hassas Yeganeh <soheil@...gle.com>,
        mitradir@...dex-team.ru
Subject: Re: [PATCH] tcp: use rto_min value from socket in retransmits timeout

Hello, Neal!

Thanks for your reply and explanations.

I agree with all your points, about safe defaults for both timeouts and the number of retries. But what the patch does is not changing the defaults, it only provides a way to work with these values through bpf, which is important in an environment that is way different from cellular networks. For example in the modern DC the rto_min value should correspond with real RTT, that definitely not 200ms.

Also I add Alexander Azimov for further discussions.

--
Dmitry

> On 23 Jul 2021, at 17:41, Neal Cardwell <ncardwell@...gle.com> wrote:
> 
> .(On Fri, Jul 23, 2021 at 5:41 AM Dmitry Yakunin <zeil@...dex-team.ru> wrote:
>> 
>> Commit ca584ba07086 ("tcp: bpf: Add TCP_BPF_RTO_MIN for bpf_setsockopt")
>> adds ability to set rto_min value on socket less then default TCP_RTO_MIN.
>> But retransmits_timed_out() function still uses TCP_RTO_MIN and
>> tcp_retries{1,2} sysctls don't work properly for tuned socket values.
>> 
>> Fixes: ca584ba07086 ("tcp: bpf: Add TCP_BPF_RTO_MIN for bpf_setsockopt")
>> Signed-off-by: Dmitry Yakunin <zeil@...dex-team.ru>
>> Acked-by: Dmitry Monakhov <dmtrmonakhov@...dex-team.ru>
>> ---
>> net/ipv4/tcp_timer.c | 7 ++++---
>> 1 file changed, 4 insertions(+), 3 deletions(-)
>> 
>> diff --git a/net/ipv4/tcp_timer.c b/net/ipv4/tcp_timer.c
>> index 20cf4a9..66c4b97 100644
>> --- a/net/ipv4/tcp_timer.c
>> +++ b/net/ipv4/tcp_timer.c
>> @@ -199,12 +199,13 @@ static unsigned int tcp_model_timeout(struct sock *sk,
>>  *  @boundary: max number of retransmissions
>>  *  @timeout:  A custom timeout value.
>>  *             If set to 0 the default timeout is calculated and used.
>> - *             Using TCP_RTO_MIN and the number of unsuccessful retransmits.
>> + *             Using icsk_rto_min value from socket or RTAX_RTO_MIN from route
>> + *             and the number of unsuccessful retransmits.
>>  *
>>  * The default "timeout" value this function can calculate and use
>>  * is equivalent to the timeout of a TCP Connection
>>  * after "boundary" unsuccessful, exponentially backed-off
>> - * retransmissions with an initial RTO of TCP_RTO_MIN.
>> + * retransmissions with an initial RTO of icsk_rto_min or RTAX_RTO_MIN.
>>  */
>> static bool retransmits_timed_out(struct sock *sk,
>>                                  unsigned int boundary,
>> @@ -217,7 +218,7 @@ static bool retransmits_timed_out(struct sock *sk,
>> 
>>        start_ts = tcp_sk(sk)->retrans_stamp;
>>        if (likely(timeout == 0)) {
>> -               unsigned int rto_base = TCP_RTO_MIN;
>> +               unsigned int rto_base = tcp_rto_min(sk);
>> 
>>                if ((1 << sk->sk_state) & (TCPF_SYN_SENT | TCPF_SYN_RECV))
>>                        rto_base = tcp_timeout_init(sk);
>> --
> 
> I would argue strenuously against this. We tried the approach in this
> patch at Google years ago, but we had to revert that approach and go
> back to using TCP_RTO_MIN as the baseline for the computation, because
> using the custom tcp_rto_min(sk) caused serious reliability problems.
> 
> The behavior in this patch causes various serious reliability problems
> because the retransmits_timed_out() computation is used for various
> timeout decisions that determine how long a connection tries to
> retransmit something before deciding the path is bad and/or giving up
> and closing the connection. Here are a few of the problems this
> causes:
> 
> (1) The biggest one is probably orphan retries. By default
> tcp_orphan_retries() uses a retry count of 8. But if your min_rto is
> 5ms (used at Google for many years), then the 8 retries means an
> orphaned connection (whose fd is no longer held by a process, but is
> still established) only lasts for 1.275 seconds before giving up and
> closing. This means that connectivity problems longer than 1.275
> seconds (extremely common with cellular links) are not tolerated for
> such connections; the connections often do not receive the data they
> were supposed to receive.
> 
> (2) TCP_RETR1 /sysctl_tcp_retries1, used for __dst_negative_advice(),
> also has big problems. Even with a min_rto as big as 20ms, on a route
> with 150ms RTT, the approach in this patch will cause
> retransmits_timed_out() to return true upon the 1st RTO timer firing,
> even though TCP_RETR1 is 3.
> 
> (3) TCP_RETR2 /sysctl_tcp_retries2, with a default of 15, used for
> regular connection retry lifetimes, also has a massive decrease in
> robustness, due to falling from  109 minutes with a 200ms RTO, to
> about 2.7 minutes with a min_rto of 5ms.
> 
> neal