[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <CANn89iJ4b83GKa+LnGUsaTvBwK+eGwaetktOwXfoR9J5b9JbzQ@mail.gmail.com>
Date: Tue, 7 Nov 2023 21:26:41 +0100
From: Eric Dumazet <edumazet@...gle.com>
To: David Morley <morleyd.kernel@...il.com>
Cc: David Miller <davem@...emloft.net>, Jakub Kicinski <kuba@...nel.org>, netdev@...r.kernel.org,
David Morley <morleyd@...gle.com>, Yuchung Cheng <ycheng@...gle.com>,
Neal Cardwell <ncardwell@...gle.com>
Subject: Re: [PATCH net-next] tcp: make the first N SYN RTO backoffs linear
On Tue, May 9, 2023 at 8:06 PM David Morley <morleyd.kernel@...il.com> wrote:
>
> From: David Morley <morleyd@...gle.com>
>
> Currently the SYN RTO schedule follows an exponential backoff
> scheme, which can be unnecessarily conservative in cases where
> there are link failures. In such cases, it's better to
> aggressively try to retransmit packets, so it takes routers
> less time to find a repath with a working link.
>
> We chose a default value for this sysctl of 4, to follow
> the macOS and IOS backoff scheme of 1,1,1,1,1,2,4,8, ...
> MacOS and IOS have used this backoff schedule for over
> a decade, since before this 2009 IETF presentation
> discussed the behavior:
> https://www.ietf.org/proceedings/75/slides/tcpm-1.pdf
>
> This commit makes the SYN RTO schedule start with a number of
> linear backoffs given by the following sysctl:
> * tcp_syn_linear_timeouts
>
> This changes the SYN RTO scheme to be: init_rto_val for
> tcp_syn_linear_timeouts, exp backoff starting at init_rto_val
>
> For example if init_rto_val = 1 and tcp_syn_linear_timeouts = 2, our
> backoff scheme would be: 1, 1, 1, 2, 4, 8, 16, ...
>
> Signed-off-by: David Morley <morleyd@...gle.com>
> Signed-off-by: Yuchung Cheng <ycheng@...gle.com>
> Signed-off-by: Neal Cardwell <ncardwell@...gle.com>
> Tested-by: David Morley <morleyd@...gle.com>
> Reviewed-by: Eric Dumazet <edumazet@...gle.com>
> ---
> Documentation/networking/ip-sysctl.rst | 17 ++++++++++++++---
> include/net/netns/ipv4.h | 1 +
> net/ipv4/sysctl_net_ipv4.c | 10 ++++++++++
> net/ipv4/tcp_ipv4.c | 1 +
> net/ipv4/tcp_timer.c | 17 +++++++++++++----
> 5 files changed, 39 insertions(+), 7 deletions(-)
>
> diff --git a/Documentation/networking/ip-sysctl.rst b/Documentation/networking/ip-sysctl.rst
> index 6ec06a33688a..3f6d3d5f5626 100644
> --- a/Documentation/networking/ip-sysctl.rst
> +++ b/Documentation/networking/ip-sysctl.rst
> @@ -881,9 +881,10 @@ tcp_fastopen_key - list of comma separated 32-digit hexadecimal INTEGERs
> tcp_syn_retries - INTEGER
> Number of times initial SYNs for an active TCP connection attempt
> will be retransmitted. Should not be higher than 127. Default value
> - is 6, which corresponds to 63seconds till the last retransmission
> - with the current initial RTO of 1second. With this the final timeout
> - for an active TCP connection attempt will happen after 127seconds.
> + is 6, which corresponds to 67seconds (with tcp_syn_linear_timeouts = 4)
> + till the last retransmission with the current initial RTO of 1second.
> + With this the final timeout for an active TCP connection attempt
> + will happen after 131seconds.
>
> tcp_timestamps - INTEGER
> Enable timestamps as defined in RFC1323.
> @@ -946,6 +947,16 @@ tcp_pacing_ca_ratio - INTEGER
>
> Default: 120
>
> +tcp_syn_linear_timeouts - INTEGER
> + The number of times for an active TCP connection to retransmit SYNs with
> + a linear backoff timeout before defaulting to an exponential backoff
> + timeout. This has no effect on SYNACK at the passive TCP side.
> +
> + With an initial RTO of 1 and tcp_syn_linear_timeouts = 4 we would
> + expect SYN RTOs to be: 1, 1, 1, 1, 1, 2, 4, ... (4 linear timeouts,
> + and the first exponential backoff using 2^0 * initial_RTO).
> + Default: 4
> +
> tcp_tso_win_divisor - INTEGER
> This allows control over what percentage of the congestion window
> can be consumed by a single TSO frame.
> diff --git a/include/net/netns/ipv4.h b/include/net/netns/ipv4.h
> index db762e35aca9..a4efb7a2796c 100644
> --- a/include/net/netns/ipv4.h
> +++ b/include/net/netns/ipv4.h
> @@ -194,6 +194,7 @@ struct netns_ipv4 {
> int sysctl_udp_rmem_min;
>
> u8 sysctl_fib_notify_on_flag_change;
> + u8 sysctl_tcp_syn_linear_timeouts;
>
> #ifdef CONFIG_NET_L3_MASTER_DEV
> u8 sysctl_udp_l3mdev_accept;
> diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c
> index 40fe70fc2015..6ae3345a3bdf 100644
> --- a/net/ipv4/sysctl_net_ipv4.c
> +++ b/net/ipv4/sysctl_net_ipv4.c
> @@ -34,6 +34,7 @@ static int ip_ttl_min = 1;
> static int ip_ttl_max = 255;
> static int tcp_syn_retries_min = 1;
> static int tcp_syn_retries_max = MAX_TCP_SYNCNT;
> +static int tcp_syn_linear_timeouts_max = MAX_TCP_SYNCNT;
> static int ip_ping_group_range_min[] = { 0, 0 };
> static int ip_ping_group_range_max[] = { GID_T_MAX, GID_T_MAX };
> static u32 u32_max_div_HZ = UINT_MAX / HZ;
> @@ -1470,6 +1471,15 @@ static struct ctl_table ipv4_net_table[] = {
> .extra1 = SYSCTL_ZERO,
> .extra2 = &tcp_plb_max_cong_thresh,
> },
> + {
> + .procname = "tcp_syn_linear_timeouts",
> + .data = &init_net.ipv4.sysctl_tcp_syn_linear_timeouts,
> + .maxlen = sizeof(u8),
> + .mode = 0644,
> + .proc_handler = proc_dou8vec_minmax,
> + .extra1 = SYSCTL_ZERO,
> + .extra2 = &tcp_syn_linear_timeouts_max,
> + },
> { }
> };
>
> diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
> index 39bda2b1066e..db24ed8f8509 100644
> --- a/net/ipv4/tcp_ipv4.c
> +++ b/net/ipv4/tcp_ipv4.c
> @@ -3275,6 +3275,7 @@ static int __net_init tcp_sk_init(struct net *net)
> else
> net->ipv4.tcp_congestion_control = &tcp_reno;
>
> + net->ipv4.sysctl_tcp_syn_linear_timeouts = 4;
> return 0;
> }
>
> diff --git a/net/ipv4/tcp_timer.c b/net/ipv4/tcp_timer.c
> index b839c2f91292..0d93a2573807 100644
> --- a/net/ipv4/tcp_timer.c
> +++ b/net/ipv4/tcp_timer.c
> @@ -234,14 +234,19 @@ static int tcp_write_timeout(struct sock *sk)
> struct tcp_sock *tp = tcp_sk(sk);
> struct net *net = sock_net(sk);
> bool expired = false, do_reset;
> - int retry_until;
> + int retry_until, max_retransmits;
>
> if ((1 << sk->sk_state) & (TCPF_SYN_SENT | TCPF_SYN_RECV)) {
> if (icsk->icsk_retransmits)
> __dst_negative_advice(sk);
> retry_until = icsk->icsk_syn_retries ? :
> READ_ONCE(net->ipv4.sysctl_tcp_syn_retries);
> - expired = icsk->icsk_retransmits >= retry_until;
> +
> + max_retransmits = retry_until;
> + if (sk->sk_state == TCP_SYN_SENT)
> + max_retransmits += READ_ONCE(net->ipv4.sysctl_tcp_syn_linear_timeouts);
> +
> + expired = icsk->icsk_retransmits >= max_retransmits;
> } else {
> if (retransmits_timed_out(sk, READ_ONCE(net->ipv4.sysctl_tcp_retries1), 0)) {
> /* Black hole detection */
> @@ -577,8 +582,12 @@ void tcp_retransmit_timer(struct sock *sk)
> icsk->icsk_retransmits <= TCP_THIN_LINEAR_RETRIES) {
> icsk->icsk_backoff = 0;
> icsk->icsk_rto = min(__tcp_set_rto(tp), TCP_RTO_MAX);
> - } else {
> - /* Use normal (exponential) backoff */
> + } else if (sk->sk_state != TCP_SYN_SENT ||
> + icsk->icsk_backoff >
> + READ_ONCE(net->ipv4.sysctl_tcp_syn_linear_timeouts)) {
> + /* Use normal (exponential) backoff unless linear timeouts are
> + * activated.
> + */
Hi David, back to this patch.
If sysctl_tcp_syn_linear_timeouts is set to a high value, we could end up with
icsk->icsk_backoff > 64
This can generate various overflows later, eg from inet_csk_rto_backoff(),
called from tcp_ld_RTO_revert()
More generally tcp_ld_RTO_revert() is not aware of linear timeouts.
Thank you.
Powered by blists - more mailing lists