[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-Id: <E09A2DA0-A741-4566-B8C6-09C563546538@yandex-team.ru>
Date: Fri, 30 Jul 2021 15:37:03 +0300
From: Dmitry Yakunin <zeil@...dex-team.ru>
To: Neal Cardwell <ncardwell@...gle.com>
Cc: kafai@...com, edumazet@...gle.com, netdev@...r.kernel.org,
bpf@...r.kernel.org, dmtrmonakhov@...dex-team.ru,
Yuchung Cheng <ycheng@...gle.com>,
Soheil Hassas Yeganeh <soheil@...gle.com>,
mitradir@...dex-team.ru
Subject: Re: [PATCH] tcp: use rto_min value from socket in retransmits timeout
Hello, Neal!
Thanks for your reply and explanations.
I agree with all your points, about safe defaults for both timeouts and the number of retries. But what the patch does is not changing the defaults, it only provides a way to work with these values through bpf, which is important in an environment that is way different from cellular networks. For example in the modern DC the rto_min value should correspond with real RTT, that definitely not 200ms.
Also I add Alexander Azimov for further discussions.
--
Dmitry
> On 23 Jul 2021, at 17:41, Neal Cardwell <ncardwell@...gle.com> wrote:
>
> .(On Fri, Jul 23, 2021 at 5:41 AM Dmitry Yakunin <zeil@...dex-team.ru> wrote:
>>
>> Commit ca584ba07086 ("tcp: bpf: Add TCP_BPF_RTO_MIN for bpf_setsockopt")
>> adds ability to set rto_min value on socket less then default TCP_RTO_MIN.
>> But retransmits_timed_out() function still uses TCP_RTO_MIN and
>> tcp_retries{1,2} sysctls don't work properly for tuned socket values.
>>
>> Fixes: ca584ba07086 ("tcp: bpf: Add TCP_BPF_RTO_MIN for bpf_setsockopt")
>> Signed-off-by: Dmitry Yakunin <zeil@...dex-team.ru>
>> Acked-by: Dmitry Monakhov <dmtrmonakhov@...dex-team.ru>
>> ---
>> net/ipv4/tcp_timer.c | 7 ++++---
>> 1 file changed, 4 insertions(+), 3 deletions(-)
>>
>> diff --git a/net/ipv4/tcp_timer.c b/net/ipv4/tcp_timer.c
>> index 20cf4a9..66c4b97 100644
>> --- a/net/ipv4/tcp_timer.c
>> +++ b/net/ipv4/tcp_timer.c
>> @@ -199,12 +199,13 @@ static unsigned int tcp_model_timeout(struct sock *sk,
>> * @boundary: max number of retransmissions
>> * @timeout: A custom timeout value.
>> * If set to 0 the default timeout is calculated and used.
>> - * Using TCP_RTO_MIN and the number of unsuccessful retransmits.
>> + * Using icsk_rto_min value from socket or RTAX_RTO_MIN from route
>> + * and the number of unsuccessful retransmits.
>> *
>> * The default "timeout" value this function can calculate and use
>> * is equivalent to the timeout of a TCP Connection
>> * after "boundary" unsuccessful, exponentially backed-off
>> - * retransmissions with an initial RTO of TCP_RTO_MIN.
>> + * retransmissions with an initial RTO of icsk_rto_min or RTAX_RTO_MIN.
>> */
>> static bool retransmits_timed_out(struct sock *sk,
>> unsigned int boundary,
>> @@ -217,7 +218,7 @@ static bool retransmits_timed_out(struct sock *sk,
>>
>> start_ts = tcp_sk(sk)->retrans_stamp;
>> if (likely(timeout == 0)) {
>> - unsigned int rto_base = TCP_RTO_MIN;
>> + unsigned int rto_base = tcp_rto_min(sk);
>>
>> if ((1 << sk->sk_state) & (TCPF_SYN_SENT | TCPF_SYN_RECV))
>> rto_base = tcp_timeout_init(sk);
>> --
>
> I would argue strenuously against this. We tried the approach in this
> patch at Google years ago, but we had to revert that approach and go
> back to using TCP_RTO_MIN as the baseline for the computation, because
> using the custom tcp_rto_min(sk) caused serious reliability problems.
>
> The behavior in this patch causes various serious reliability problems
> because the retransmits_timed_out() computation is used for various
> timeout decisions that determine how long a connection tries to
> retransmit something before deciding the path is bad and/or giving up
> and closing the connection. Here are a few of the problems this
> causes:
>
> (1) The biggest one is probably orphan retries. By default
> tcp_orphan_retries() uses a retry count of 8. But if your min_rto is
> 5ms (used at Google for many years), then the 8 retries means an
> orphaned connection (whose fd is no longer held by a process, but is
> still established) only lasts for 1.275 seconds before giving up and
> closing. This means that connectivity problems longer than 1.275
> seconds (extremely common with cellular links) are not tolerated for
> such connections; the connections often do not receive the data they
> were supposed to receive.
>
> (2) TCP_RETR1 /sysctl_tcp_retries1, used for __dst_negative_advice(),
> also has big problems. Even with a min_rto as big as 20ms, on a route
> with 150ms RTT, the approach in this patch will cause
> retransmits_timed_out() to return true upon the 1st RTO timer firing,
> even though TCP_RETR1 is 3.
>
> (3) TCP_RETR2 /sysctl_tcp_retries2, with a default of 15, used for
> regular connection retry lifetimes, also has a massive decrease in
> robustness, due to falling from 109 minutes with a 200ms RTO, to
> about 2.7 minutes with a min_rto of 5ms.
>
> neal
Powered by blists - more mailing lists