netdev - Re: [PATCH net-next] tcp: make the first N SYN RTO backoffs linear

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <CANn89iJ4b83GKa+LnGUsaTvBwK+eGwaetktOwXfoR9J5b9JbzQ@mail.gmail.com>
Date: Tue, 7 Nov 2023 21:26:41 +0100
From: Eric Dumazet <edumazet@...gle.com>
To: David Morley <morleyd.kernel@...il.com>
Cc: David Miller <davem@...emloft.net>, Jakub Kicinski <kuba@...nel.org>, netdev@...r.kernel.org, 
	David Morley <morleyd@...gle.com>, Yuchung Cheng <ycheng@...gle.com>, 
	Neal Cardwell <ncardwell@...gle.com>
Subject: Re: [PATCH net-next] tcp: make the first N SYN RTO backoffs linear

On Tue, May 9, 2023 at 8:06 PM David Morley <morleyd.kernel@...il.com> wrote:
>
> From: David Morley <morleyd@...gle.com>
>
> Currently the SYN RTO schedule follows an exponential backoff
> scheme, which can be unnecessarily conservative in cases where
> there are link failures. In such cases, it's better to
> aggressively try to retransmit packets, so it takes routers
> less time to find a repath with a working link.
>
> We chose a default value for this sysctl of 4, to follow
> the macOS and IOS backoff scheme of 1,1,1,1,1,2,4,8, ...
> MacOS and IOS have used this backoff schedule for over
> a decade, since before this 2009 IETF presentation
> discussed the behavior:
> https://www.ietf.org/proceedings/75/slides/tcpm-1.pdf
>
> This commit makes the SYN RTO schedule start with a number of
> linear backoffs given by the following sysctl:
> * tcp_syn_linear_timeouts
>
> This changes the SYN RTO scheme to be: init_rto_val for
> tcp_syn_linear_timeouts, exp backoff starting at init_rto_val
>
> For example if init_rto_val = 1 and tcp_syn_linear_timeouts = 2, our
> backoff scheme would be: 1, 1, 1, 2, 4, 8, 16, ...
>
> Signed-off-by: David Morley <morleyd@...gle.com>
> Signed-off-by: Yuchung Cheng <ycheng@...gle.com>
> Signed-off-by: Neal Cardwell <ncardwell@...gle.com>
> Tested-by: David Morley <morleyd@...gle.com>
> Reviewed-by: Eric Dumazet <edumazet@...gle.com>
> ---
>  Documentation/networking/ip-sysctl.rst | 17 ++++++++++++++---
>  include/net/netns/ipv4.h               |  1 +
>  net/ipv4/sysctl_net_ipv4.c             | 10 ++++++++++
>  net/ipv4/tcp_ipv4.c                    |  1 +
>  net/ipv4/tcp_timer.c                   | 17 +++++++++++++----
>  5 files changed, 39 insertions(+), 7 deletions(-)
>
> diff --git a/Documentation/networking/ip-sysctl.rst b/Documentation/networking/ip-sysctl.rst
> index 6ec06a33688a..3f6d3d5f5626 100644
> --- a/Documentation/networking/ip-sysctl.rst
> +++ b/Documentation/networking/ip-sysctl.rst
> @@ -881,9 +881,10 @@ tcp_fastopen_key - list of comma separated 32-digit hexadecimal INTEGERs
>  tcp_syn_retries - INTEGER
>         Number of times initial SYNs for an active TCP connection attempt
>         will be retransmitted. Should not be higher than 127. Default value
> -       is 6, which corresponds to 63seconds till the last retransmission
> -       with the current initial RTO of 1second. With this the final timeout
> -       for an active TCP connection attempt will happen after 127seconds.
> +       is 6, which corresponds to 67seconds (with tcp_syn_linear_timeouts = 4)
> +       till the last retransmission with the current initial RTO of 1second.
> +       With this the final timeout for an active TCP connection attempt
> +       will happen after 131seconds.
>
>  tcp_timestamps - INTEGER
>         Enable timestamps as defined in RFC1323.
> @@ -946,6 +947,16 @@ tcp_pacing_ca_ratio - INTEGER
>
>         Default: 120
>
> +tcp_syn_linear_timeouts - INTEGER
> +       The number of times for an active TCP connection to retransmit SYNs with
> +       a linear backoff timeout before defaulting to an exponential backoff
> +       timeout. This has no effect on SYNACK at the passive TCP side.
> +
> +       With an initial RTO of 1 and tcp_syn_linear_timeouts = 4 we would
> +       expect SYN RTOs to be: 1, 1, 1, 1, 1, 2, 4, ... (4 linear timeouts,
> +       and the first exponential backoff using 2^0 * initial_RTO).
> +       Default: 4
> +
>  tcp_tso_win_divisor - INTEGER
>         This allows control over what percentage of the congestion window
>         can be consumed by a single TSO frame.
> diff --git a/include/net/netns/ipv4.h b/include/net/netns/ipv4.h
> index db762e35aca9..a4efb7a2796c 100644
> --- a/include/net/netns/ipv4.h
> +++ b/include/net/netns/ipv4.h
> @@ -194,6 +194,7 @@ struct netns_ipv4 {
>         int sysctl_udp_rmem_min;
>
>         u8 sysctl_fib_notify_on_flag_change;
> +       u8 sysctl_tcp_syn_linear_timeouts;
>
>  #ifdef CONFIG_NET_L3_MASTER_DEV
>         u8 sysctl_udp_l3mdev_accept;
> diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c
> index 40fe70fc2015..6ae3345a3bdf 100644
> --- a/net/ipv4/sysctl_net_ipv4.c
> +++ b/net/ipv4/sysctl_net_ipv4.c
> @@ -34,6 +34,7 @@ static int ip_ttl_min = 1;
>  static int ip_ttl_max = 255;
>  static int tcp_syn_retries_min = 1;
>  static int tcp_syn_retries_max = MAX_TCP_SYNCNT;
> +static int tcp_syn_linear_timeouts_max = MAX_TCP_SYNCNT;
>  static int ip_ping_group_range_min[] = { 0, 0 };
>  static int ip_ping_group_range_max[] = { GID_T_MAX, GID_T_MAX };
>  static u32 u32_max_div_HZ = UINT_MAX / HZ;
> @@ -1470,6 +1471,15 @@ static struct ctl_table ipv4_net_table[] = {
>                 .extra1         = SYSCTL_ZERO,
>                 .extra2         = &tcp_plb_max_cong_thresh,
>         },
> +       {
> +               .procname = "tcp_syn_linear_timeouts",
> +               .data = &init_net.ipv4.sysctl_tcp_syn_linear_timeouts,
> +               .maxlen = sizeof(u8),
> +               .mode = 0644,
> +               .proc_handler = proc_dou8vec_minmax,
> +               .extra1 = SYSCTL_ZERO,
> +               .extra2 = &tcp_syn_linear_timeouts_max,
> +       },
>         { }
>  };
>
> diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
> index 39bda2b1066e..db24ed8f8509 100644
> --- a/net/ipv4/tcp_ipv4.c
> +++ b/net/ipv4/tcp_ipv4.c
> @@ -3275,6 +3275,7 @@ static int __net_init tcp_sk_init(struct net *net)
>         else
>                 net->ipv4.tcp_congestion_control = &tcp_reno;
>
> +       net->ipv4.sysctl_tcp_syn_linear_timeouts = 4;
>         return 0;
>  }
>
> diff --git a/net/ipv4/tcp_timer.c b/net/ipv4/tcp_timer.c
> index b839c2f91292..0d93a2573807 100644
> --- a/net/ipv4/tcp_timer.c
> +++ b/net/ipv4/tcp_timer.c
> @@ -234,14 +234,19 @@ static int tcp_write_timeout(struct sock *sk)
>         struct tcp_sock *tp = tcp_sk(sk);
>         struct net *net = sock_net(sk);
>         bool expired = false, do_reset;
> -       int retry_until;
> +       int retry_until, max_retransmits;
>
>         if ((1 << sk->sk_state) & (TCPF_SYN_SENT | TCPF_SYN_RECV)) {
>                 if (icsk->icsk_retransmits)
>                         __dst_negative_advice(sk);
>                 retry_until = icsk->icsk_syn_retries ? :
>                         READ_ONCE(net->ipv4.sysctl_tcp_syn_retries);
> -               expired = icsk->icsk_retransmits >= retry_until;
> +
> +               max_retransmits = retry_until;
> +               if (sk->sk_state == TCP_SYN_SENT)
> +                       max_retransmits += READ_ONCE(net->ipv4.sysctl_tcp_syn_linear_timeouts);
> +
> +               expired = icsk->icsk_retransmits >= max_retransmits;
>         } else {
>                 if (retransmits_timed_out(sk, READ_ONCE(net->ipv4.sysctl_tcp_retries1), 0)) {
>                         /* Black hole detection */
> @@ -577,8 +582,12 @@ void tcp_retransmit_timer(struct sock *sk)
>             icsk->icsk_retransmits <= TCP_THIN_LINEAR_RETRIES) {
>                 icsk->icsk_backoff = 0;
>                 icsk->icsk_rto = min(__tcp_set_rto(tp), TCP_RTO_MAX);
> -       } else {
> -               /* Use normal (exponential) backoff */
> +       } else if (sk->sk_state != TCP_SYN_SENT ||
> +                  icsk->icsk_backoff >
> +                  READ_ONCE(net->ipv4.sysctl_tcp_syn_linear_timeouts)) {
> +               /* Use normal (exponential) backoff unless linear timeouts are
> +                * activated.
> +                */

Hi David, back to this patch.

If sysctl_tcp_syn_linear_timeouts is set to a high value, we could end up with
 icsk->icsk_backoff > 64

This can generate various overflows later, eg from inet_csk_rto_backoff(),
called from tcp_ld_RTO_revert()

More generally tcp_ld_RTO_revert() is not aware of linear timeouts.

Thank you.