netdev - Re: [PATCH] tcp: remove bad timeout logic in fast recovery

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <CAK6E8=canUADQE_sQkhHTeWbRHgLd+56TM2nwaXRDM5dF_f1Bg@mail.gmail.com>
Date:	Fri, 17 May 2013 16:43:53 -0700
From:	Yuchung Cheng <ycheng@...gle.com>
To:	David Miller <davem@...emloft.net>,
	Neal Cardwell <ncardwell@...gle.com>,
	Eric Dumazet <edumazet@...gle.com>,
	Nandita Dukkipati <nanditad@...gle.com>
Cc:	Ilpo Jarvinen <ilpo.jarvinen@...helsinki.fi>,
	netdev <netdev@...r.kernel.org>,
	Yuchung Cheng <ycheng@...gle.com>
Subject: Re: [PATCH] tcp: remove bad timeout logic in fast recovery

oops I put some unnecessary header on the commit message. Please
ignore this patch. I will send another one.

On Fri, May 17, 2013 at 4:36 PM, Yuchung Cheng <ycheng@...gle.com> wrote:
> tcp_timeout_skb() was intended to trigger fast recovery on timeout,
> unfortunately in reality it often causes spurious retransmission
> storms during fast recovery. The particular sign is fast retransmit
> over highest sacked sequence (SND.FACK).
>
> Currently the RTO timer re-arming (as in RFC6298) offers a nice cushion
> to avoid spurious timeout: when SND.UNA advances the sender re-arms
> RTO and extends the timeout by icsk_rto. The sender does not offset
> the time elapsed since the packet at SND.UNA was sent.
>
> But if the next (DUP)ACK arrives later than ~RTTVAR and triggers
> tcp_fastretrans_alert(), then tcp_timeout_skb() will mark any packet
> sent before icsk_rto interval lost, including one that's above the
> highest sacked sequence. Most likely a large part of scorebard will
> be marked.
>
> If most packets are not lost then the subsequence DUPACKs with new
> SACK blockes will cause the sender to continue retransmit packets
> beyond SND.FACK spuriously right. Even only one packet is lost the
> sender may falsely retransmit almost the entire window.
>
> The situation becomes common in the world of bufferbloat: the RTT
> continues to grow as the queue builds up but RTTVAR remains small and
> close to the minimum 200ms. If a data packet is lost and the DUPACK
> triggered by the next data packet is slightly delayed, then a spurious
> retransmission storm forms.
>
> As the original comment on tcp_timeout_skb() suggests: the usefulness
> of this feature is questionable. It also wastes cycles walking the
> sack scoreboard and is actually harmful because of the false recovery.
> It's time to remove this.
>
> Change-Id: I8eb2ae80e032dbc07a4d87bf5a68771856a8c04c
> Signed-off-by: Yuchung Cheng <ycheng@...gle.com>
> ---
>  include/linux/tcp.h  |  1 -
>  include/net/tcp.h    |  1 -
>  net/ipv4/tcp_input.c | 65 +---------------------------------------------------
>  3 files changed, 1 insertion(+), 66 deletions(-)
>
> diff --git a/include/linux/tcp.h b/include/linux/tcp.h
> index 5adbc33..472120b 100644
> --- a/include/linux/tcp.h
> +++ b/include/linux/tcp.h
> @@ -246,7 +246,6 @@ struct tcp_sock {
>
>         /* from STCP, retrans queue hinting */
>         struct sk_buff* lost_skb_hint;
> -       struct sk_buff *scoreboard_skb_hint;
>         struct sk_buff *retransmit_skb_hint;
>
>         struct sk_buff_head     out_of_order_queue; /* Out of order segments go here */
> diff --git a/include/net/tcp.h b/include/net/tcp.h
> index 5bba80f..e1c3723 100644
> --- a/include/net/tcp.h
> +++ b/include/net/tcp.h
> @@ -1193,7 +1193,6 @@ static inline void tcp_mib_init(struct net *net)
>  static inline void tcp_clear_retrans_hints_partial(struct tcp_sock *tp)
>  {
>         tp->lost_skb_hint = NULL;
> -       tp->scoreboard_skb_hint = NULL;
>  }
>
>  static inline void tcp_clear_all_retrans_hints(struct tcp_sock *tp)
> diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
> index b358e8c..d7d3694 100644
> --- a/net/ipv4/tcp_input.c
> +++ b/net/ipv4/tcp_input.c
> @@ -1255,8 +1255,6 @@ static bool tcp_shifted_skb(struct sock *sk, struct sk_buff *skb,
>
>         if (skb == tp->retransmit_skb_hint)
>                 tp->retransmit_skb_hint = prev;
> -       if (skb == tp->scoreboard_skb_hint)
> -               tp->scoreboard_skb_hint = prev;
>         if (skb == tp->lost_skb_hint) {
>                 tp->lost_skb_hint = prev;
>                 tp->lost_cnt_hint -= tcp_skb_pcount(prev);
> @@ -1964,20 +1962,6 @@ static bool tcp_pause_early_retransmit(struct sock *sk, int flag)
>         return true;
>  }
>
> -static inline int tcp_skb_timedout(const struct sock *sk,
> -                                  const struct sk_buff *skb)
> -{
> -       return tcp_time_stamp - TCP_SKB_CB(skb)->when > inet_csk(sk)->icsk_rto;
> -}
> -
> -static inline int tcp_head_timedout(const struct sock *sk)
> -{
> -       const struct tcp_sock *tp = tcp_sk(sk);
> -
> -       return tp->packets_out &&
> -              tcp_skb_timedout(sk, tcp_write_queue_head(sk));
> -}
> -
>  /* Linux NewReno/SACK/FACK/ECN state machine.
>   * --------------------------------------
>   *
> @@ -2084,12 +2068,6 @@ static bool tcp_time_to_recover(struct sock *sk, int flag)
>         if (tcp_dupack_heuristics(tp) > tp->reordering)
>                 return true;
>
> -       /* Trick#3 : when we use RFC2988 timer restart, fast
> -        * retransmit can be triggered by timeout of queue head.
> -        */
> -       if (tcp_is_fack(tp) && tcp_head_timedout(sk))
> -               return true;
> -
>         /* Trick#4: It is still not OK... But will it be useful to delay
>          * recovery more?
>          */
> @@ -2126,44 +2104,6 @@ static bool tcp_time_to_recover(struct sock *sk, int flag)
>         return false;
>  }
>
> -/* New heuristics: it is possible only after we switched to restart timer
> - * each time when something is ACKed. Hence, we can detect timed out packets
> - * during fast retransmit without falling to slow start.
> - *
> - * Usefulness of this as is very questionable, since we should know which of
> - * the segments is the next to timeout which is relatively expensive to find
> - * in general case unless we add some data structure just for that. The
> - * current approach certainly won't find the right one too often and when it
> - * finally does find _something_ it usually marks large part of the window
> - * right away (because a retransmission with a larger timestamp blocks the
> - * loop from advancing). -ij
> - */
> -static void tcp_timeout_skbs(struct sock *sk)
> -{
> -       struct tcp_sock *tp = tcp_sk(sk);
> -       struct sk_buff *skb;
> -
> -       if (!tcp_is_fack(tp) || !tcp_head_timedout(sk))
> -               return;
> -
> -       skb = tp->scoreboard_skb_hint;
> -       if (tp->scoreboard_skb_hint == NULL)
> -               skb = tcp_write_queue_head(sk);
> -
> -       tcp_for_write_queue_from(skb, sk) {
> -               if (skb == tcp_send_head(sk))
> -                       break;
> -               if (!tcp_skb_timedout(sk, skb))
> -                       break;
> -
> -               tcp_skb_mark_lost(tp, skb);
> -       }
> -
> -       tp->scoreboard_skb_hint = skb;
> -
> -       tcp_verify_left_out(tp);
> -}
> -
>  /* Detect loss in event "A" above by marking head of queue up as lost.
>   * For FACK or non-SACK(Reno) senders, the first "packets" number of segments
>   * are considered lost. For RFC3517 SACK, a segment is considered lost if it
> @@ -2249,8 +2189,6 @@ static void tcp_update_scoreboard(struct sock *sk, int fast_rexmit)
>                 else if (fast_rexmit)
>                         tcp_mark_head_lost(sk, 1, 1);
>         }
> -
> -       tcp_timeout_skbs(sk);
>  }
>
>  /* CWND moderation, preventing bursts due to too big ACKs
> @@ -2842,7 +2780,7 @@ static void tcp_fastretrans_alert(struct sock *sk, int pkts_acked,
>                 fast_rexmit = 1;
>         }
>
> -       if (do_lost || (tcp_is_fack(tp) && tcp_head_timedout(sk)))
> +       if (do_lost)
>                 tcp_update_scoreboard(sk, fast_rexmit);
>         tcp_cwnd_reduction(sk, newly_acked_sacked, fast_rexmit);
>         tcp_xmit_retransmit_queue(sk);
> @@ -3075,7 +3013,6 @@ static int tcp_clean_rtx_queue(struct sock *sk, int prior_fackets,
>
>                 tcp_unlink_write_queue(skb, sk);
>                 sk_wmem_free_skb(sk, skb);
> -               tp->scoreboard_skb_hint = NULL;
>                 if (skb == tp->retransmit_skb_hint)
>                         tp->retransmit_skb_hint = NULL;
>                 if (skb == tp->lost_skb_hint)
> --
> 1.8.2.1
>
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html