netdev - Re: [PATCH 1/2] tcp: Tail loss probe (TLP)

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:	Mon, 11 Mar 2013 18:47:00 -0400
From:	Yuchung Cheng <ycheng@...gle.com>
To:	Nandita Dukkipati <nanditad@...gle.com>
Cc:	"David S. Miller" <davem@...emloft.net>,
	Neal Cardwell <ncardwell@...gle.com>,
	Eric Dumazet <edumazet@...gle.com>,
	netdev <netdev@...r.kernel.org>,
	Ilpo Jarvinen <ilpo.jarvinen@...helsinki.fi>,
	Tom Herbert <therbert@...gle.com>
Subject: Re: [PATCH 1/2] tcp: Tail loss probe (TLP)

On Mon, Mar 11, 2013 at 4:00 PM, Nandita Dukkipati <nanditad@...gle.com> wrote:
> This patch series implement the Tail loss probe (TLP) algorithm described
> in http://tools.ietf.org/html/draft-dukkipati-tcpm-tcp-loss-probe-01. The
> first patch implements the basic algorithm.
>
> TLP's goal is to reduce tail latency of short transactions. It achieves
> this by converting retransmission timeouts (RTOs) occuring due
> to tail losses (losses at end of transactions) into fast recovery.
> TLP transmits one packet in two round-trips when a connection is in
> Open state and isn't receiving any ACKs. The transmitted packet, aka
> loss probe, can be either new or a retransmission. When there is tail
> loss, the ACK from a loss probe triggers FACK/early-retransmit based
> fast recovery, thus avoiding a costly RTO. In the absence of loss,
> there is no change in the connection state.
>
> PTO stands for probe timeout. It is a timer event indicating
> that an ACK is overdue and triggers a loss probe packet. The PTO value
> is set to max(2*SRTT, 10ms) and is adjusted to account for delayed
> ACK timer when there is only one oustanding packet.
>
> TLP Algorithm
>
> On transmission of new data in Open state:
>   -> packets_out > 1: schedule PTO in max(2*SRTT, 10ms).
>   -> packets_out == 1: schedule PTO in max(2*RTT, 1.5*RTT + 200ms)
>   -> PTO = min(PTO, RTO)
>
> Conditions for scheduling PTO:
>   -> Connection is in Open state.
>   -> Connection is either cwnd limited or no new data to send.
>   -> Number of probes per tail loss episode is limited to one.
>   -> Connection is SACK enabled.
>
> When PTO fires:
>   new_segment_exists:
>     -> transmit new segment.
>     -> packets_out++. cwnd remains same.
>
>   no_new_packet:
>     -> retransmit the last segment.
>        Its ACK triggers FACK or early retransmit based recovery.
>
> ACK path:
>   -> rearm RTO at start of ACK processing.
>   -> reschedule PTO if need be.
>
> In addition, the patch includes a small variation to the Early Retransmit
> (ER) algorithm, such that ER and TLP together can in principle recover any
> N-degree of tail loss through fast recovery. TLP is controlled by the same
> sysctl as ER, tcp_early_retrans sysctl.
> tcp_early_retrans==0; disables TLP and ER.
>                  ==1; enables RFC5827 ER.
>                  ==2; delayed ER.
>                  ==3; TLP and delayed ER. [DEFAULT]
>                  ==4; TLP only.
>
> The TLP patch series have been extensively tested on Google Web servers.
> It is most effective for short Web trasactions, where it reduced RTOs by 15%
> and improved HTTP response time (average by 6%, 99th percentile by 10%).
> The transmitted probes account for <0.5% of the overall transmissions.
>
> Signed-off-by: Nandita Dukkipati <nanditad@...gle.com>
> ---
Acked-by: Yuchung Cheng <ycheng@...gle.com>

>  Documentation/networking/ip-sysctl.txt |   8 ++-
>  include/linux/tcp.h                    |   1 -
>  include/net/inet_connection_sock.h     |   5 +-
>  include/net/tcp.h                      |   6 +-
>  include/uapi/linux/snmp.h              |   1 +
>  net/ipv4/inet_diag.c                   |   4 +-
>  net/ipv4/proc.c                        |   1 +
>  net/ipv4/sysctl_net_ipv4.c             |   4 +-
>  net/ipv4/tcp_input.c                   |  24 ++++---
>  net/ipv4/tcp_ipv4.c                    |   4 +-
>  net/ipv4/tcp_output.c                  | 128 +++++++++++++++++++++++++++++++--
>  net/ipv4/tcp_timer.c                   |  13 ++--
>  12 files changed, 171 insertions(+), 28 deletions(-)
>
> diff --git a/Documentation/networking/ip-sysctl.txt b/Documentation/networking/ip-sysctl.txt
> index dc2dc87..1cae6c3 100644
> --- a/Documentation/networking/ip-sysctl.txt
> +++ b/Documentation/networking/ip-sysctl.txt
> @@ -190,7 +190,9 @@ tcp_early_retrans - INTEGER
>         Enable Early Retransmit (ER), per RFC 5827. ER lowers the threshold
>         for triggering fast retransmit when the amount of outstanding data is
>         small and when no previously unsent data can be transmitted (such
> -       that limited transmit could be used).
> +       that limited transmit could be used). Also controls the use of
> +       Tail loss probe (TLP) that converts RTOs occuring due to tail
> +       losses into fast recovery (draft-dukkipati-tcpm-tcp-loss-probe-01).
>         Possible values:
>                 0 disables ER
>                 1 enables ER
> @@ -198,7 +200,9 @@ tcp_early_retrans - INTEGER
>                   by a fourth of RTT. This mitigates connection falsely
>                   recovers when network has a small degree of reordering
>                   (less than 3 packets).
> -       Default: 2
> +               3 enables delayed ER and TLP.
> +               4 enables TLP only.
> +       Default: 3
>
>  tcp_ecn - INTEGER
>         Control use of Explicit Congestion Notification (ECN) by TCP.
> diff --git a/include/linux/tcp.h b/include/linux/tcp.h
> index 515c374..01860d7 100644
> --- a/include/linux/tcp.h
> +++ b/include/linux/tcp.h
> @@ -201,7 +201,6 @@ struct tcp_sock {
>                 unused      : 1;
>         u8      repair_queue;
>         u8      do_early_retrans:1,/* Enable RFC5827 early-retransmit  */
> -               early_retrans_delayed:1, /* Delayed ER timer installed */
>                 syn_data:1,     /* SYN includes data */
>                 syn_fastopen:1, /* SYN includes Fast Open option */
>                 syn_data_acked:1;/* data in SYN is acked by SYN-ACK */
> diff --git a/include/net/inet_connection_sock.h b/include/net/inet_connection_sock.h
> index 1832927..de2c785 100644
> --- a/include/net/inet_connection_sock.h
> +++ b/include/net/inet_connection_sock.h
> @@ -133,6 +133,8 @@ struct inet_connection_sock {
>  #define ICSK_TIME_RETRANS      1       /* Retransmit timer */
>  #define ICSK_TIME_DACK         2       /* Delayed ack timer */
>  #define ICSK_TIME_PROBE0       3       /* Zero window probe timer */
> +#define ICSK_TIME_EARLY_RETRANS 4      /* Early retransmit timer */
> +#define ICSK_TIME_LOSS_PROBE   5       /* Tail loss probe timer */
>
>  static inline struct inet_connection_sock *inet_csk(const struct sock *sk)
>  {
> @@ -222,7 +224,8 @@ static inline void inet_csk_reset_xmit_timer(struct sock *sk, const int what,
>                 when = max_when;
>         }
>
> -       if (what == ICSK_TIME_RETRANS || what == ICSK_TIME_PROBE0) {
> +       if (what == ICSK_TIME_RETRANS || what == ICSK_TIME_PROBE0 ||
> +           what == ICSK_TIME_EARLY_RETRANS || what ==  ICSK_TIME_LOSS_PROBE) {
>                 icsk->icsk_pending = what;
>                 icsk->icsk_timeout = jiffies + when;
>                 sk_reset_timer(sk, &icsk->icsk_retransmit_timer, icsk->icsk_timeout);
> diff --git a/include/net/tcp.h b/include/net/tcp.h
> index a2baa5e..ab9f947 100644
> --- a/include/net/tcp.h
> +++ b/include/net/tcp.h
> @@ -543,6 +543,8 @@ extern bool tcp_syn_flood_action(struct sock *sk,
>  extern void tcp_push_one(struct sock *, unsigned int mss_now);
>  extern void tcp_send_ack(struct sock *sk);
>  extern void tcp_send_delayed_ack(struct sock *sk);
> +extern void tcp_send_loss_probe(struct sock *sk);
> +extern bool tcp_schedule_loss_probe(struct sock *sk);
>
>  /* tcp_input.c */
>  extern void tcp_cwnd_application_limited(struct sock *sk);
> @@ -873,8 +875,8 @@ static inline void tcp_enable_fack(struct tcp_sock *tp)
>  static inline void tcp_enable_early_retrans(struct tcp_sock *tp)
>  {
>         tp->do_early_retrans = sysctl_tcp_early_retrans &&
> -               !sysctl_tcp_thin_dupack && sysctl_tcp_reordering == 3;
> -       tp->early_retrans_delayed = 0;
> +               sysctl_tcp_early_retrans < 4 && !sysctl_tcp_thin_dupack &&
> +               sysctl_tcp_reordering == 3;
>  }
>
>  static inline void tcp_disable_early_retrans(struct tcp_sock *tp)
> diff --git a/include/uapi/linux/snmp.h b/include/uapi/linux/snmp.h
> index b49eab8..290bed6 100644
> --- a/include/uapi/linux/snmp.h
> +++ b/include/uapi/linux/snmp.h
> @@ -202,6 +202,7 @@ enum
>         LINUX_MIB_TCPFORWARDRETRANS,            /* TCPForwardRetrans */
>         LINUX_MIB_TCPSLOWSTARTRETRANS,          /* TCPSlowStartRetrans */
>         LINUX_MIB_TCPTIMEOUTS,                  /* TCPTimeouts */
> +       LINUX_MIB_TCPLOSSPROBES,                /* TCPLossProbes */
>         LINUX_MIB_TCPRENORECOVERYFAIL,          /* TCPRenoRecoveryFail */
>         LINUX_MIB_TCPSACKRECOVERYFAIL,          /* TCPSackRecoveryFail */
>         LINUX_MIB_TCPSCHEDULERFAILED,           /* TCPSchedulerFailed */
> diff --git a/net/ipv4/inet_diag.c b/net/ipv4/inet_diag.c
> index 7afa2c3..8620408 100644
> --- a/net/ipv4/inet_diag.c
> +++ b/net/ipv4/inet_diag.c
> @@ -158,7 +158,9 @@ int inet_sk_diag_fill(struct sock *sk, struct inet_connection_sock *icsk,
>
>  #define EXPIRES_IN_MS(tmo)  DIV_ROUND_UP((tmo - jiffies) * 1000, HZ)
>
> -       if (icsk->icsk_pending == ICSK_TIME_RETRANS) {
> +       if (icsk->icsk_pending == ICSK_TIME_RETRANS ||
> +           icsk->icsk_pending == ICSK_TIME_EARLY_RETRANS ||
> +           icsk->icsk_pending == ICSK_TIME_LOSS_PROBE) {
>                 r->idiag_timer = 1;
>                 r->idiag_retrans = icsk->icsk_retransmits;
>                 r->idiag_expires = EXPIRES_IN_MS(icsk->icsk_timeout);
> diff --git a/net/ipv4/proc.c b/net/ipv4/proc.c
> index 32030a2..4c35911 100644
> --- a/net/ipv4/proc.c
> +++ b/net/ipv4/proc.c
> @@ -224,6 +224,7 @@ static const struct snmp_mib snmp4_net_list[] = {
>         SNMP_MIB_ITEM("TCPForwardRetrans", LINUX_MIB_TCPFORWARDRETRANS),
>         SNMP_MIB_ITEM("TCPSlowStartRetrans", LINUX_MIB_TCPSLOWSTARTRETRANS),
>         SNMP_MIB_ITEM("TCPTimeouts", LINUX_MIB_TCPTIMEOUTS),
> +       SNMP_MIB_ITEM("TCPLossProbes", LINUX_MIB_TCPLOSSPROBES),
>         SNMP_MIB_ITEM("TCPRenoRecoveryFail", LINUX_MIB_TCPRENORECOVERYFAIL),
>         SNMP_MIB_ITEM("TCPSackRecoveryFail", LINUX_MIB_TCPSACKRECOVERYFAIL),
>         SNMP_MIB_ITEM("TCPSchedulerFailed", LINUX_MIB_TCPSCHEDULERFAILED),
> diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c
> index 960fd29..cca4550 100644
> --- a/net/ipv4/sysctl_net_ipv4.c
> +++ b/net/ipv4/sysctl_net_ipv4.c
> @@ -28,7 +28,7 @@
>
>  static int zero;
>  static int one = 1;
> -static int two = 2;
> +static int four = 4;
>  static int tcp_retr1_max = 255;
>  static int ip_local_port_range_min[] = { 1, 1 };
>  static int ip_local_port_range_max[] = { 65535, 65535 };
> @@ -760,7 +760,7 @@ static struct ctl_table ipv4_table[] = {
>                 .mode           = 0644,
>                 .proc_handler   = proc_dointvec_minmax,
>                 .extra1         = &zero,
> -               .extra2         = &two,
> +               .extra2         = &four,
>         },
>         {
>                 .procname       = "udp_mem",
> diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
> index 0d9bdac..b794f89 100644
> --- a/net/ipv4/tcp_input.c
> +++ b/net/ipv4/tcp_input.c
> @@ -98,7 +98,7 @@ int sysctl_tcp_frto_response __read_mostly;
>  int sysctl_tcp_thin_dupack __read_mostly;
>
>  int sysctl_tcp_moderate_rcvbuf __read_mostly = 1;
> -int sysctl_tcp_early_retrans __read_mostly = 2;
> +int sysctl_tcp_early_retrans __read_mostly = 3;
>
>  #define FLAG_DATA              0x01 /* Incoming frame contained data.          */
>  #define FLAG_WIN_UPDATE                0x02 /* Incoming ACK was a window update.       */
> @@ -2150,15 +2150,16 @@ static bool tcp_pause_early_retransmit(struct sock *sk, int flag)
>          * max(RTT/4, 2msec) unless ack has ECE mark, no RTT samples
>          * available, or RTO is scheduled to fire first.
>          */
> -       if (sysctl_tcp_early_retrans < 2 || (flag & FLAG_ECE) || !tp->srtt)
> +       if (sysctl_tcp_early_retrans < 2 || sysctl_tcp_early_retrans > 3 ||
> +           (flag & FLAG_ECE) || !tp->srtt)
>                 return false;
>
>         delay = max_t(unsigned long, (tp->srtt >> 5), msecs_to_jiffies(2));
>         if (!time_after(inet_csk(sk)->icsk_timeout, (jiffies + delay)))
>                 return false;
>
> -       inet_csk_reset_xmit_timer(sk, ICSK_TIME_RETRANS, delay, TCP_RTO_MAX);
> -       tp->early_retrans_delayed = 1;
> +       inet_csk_reset_xmit_timer(sk, ICSK_TIME_EARLY_RETRANS, delay,
> +                                 TCP_RTO_MAX);
>         return true;
>  }
>
> @@ -2321,7 +2322,7 @@ static bool tcp_time_to_recover(struct sock *sk, int flag)
>          * interval if appropriate.
>          */
>         if (tp->do_early_retrans && !tp->retrans_out && tp->sacked_out &&
> -           (tp->packets_out == (tp->sacked_out + 1) && tp->packets_out < 4) &&
> +           (tp->packets_out >= (tp->sacked_out + 1) && tp->packets_out < 4) &&
>             !tcp_may_send_now(sk))
>                 return !tcp_pause_early_retransmit(sk, flag);
>
> @@ -3081,6 +3082,7 @@ static void tcp_cong_avoid(struct sock *sk, u32 ack, u32 in_flight)
>   */
>  void tcp_rearm_rto(struct sock *sk)
>  {
> +       const struct inet_connection_sock *icsk = inet_csk(sk);
>         struct tcp_sock *tp = tcp_sk(sk);
>
>         /* If the retrans timer is currently being used by Fast Open
> @@ -3094,12 +3096,13 @@ void tcp_rearm_rto(struct sock *sk)
>         } else {
>                 u32 rto = inet_csk(sk)->icsk_rto;
>                 /* Offset the time elapsed after installing regular RTO */
> -               if (tp->early_retrans_delayed) {
> +               if (icsk->icsk_pending == ICSK_TIME_EARLY_RETRANS ||
> +                   icsk->icsk_pending == ICSK_TIME_LOSS_PROBE) {
>                         struct sk_buff *skb = tcp_write_queue_head(sk);
>                         const u32 rto_time_stamp = TCP_SKB_CB(skb)->when + rto;
>                         s32 delta = (s32)(rto_time_stamp - tcp_time_stamp);
>                         /* delta may not be positive if the socket is locked
> -                        * when the delayed ER timer fires and is rescheduled.
> +                        * when the retrans timer fires and is rescheduled.
>                          */
>                         if (delta > 0)
>                                 rto = delta;
> @@ -3107,7 +3110,6 @@ void tcp_rearm_rto(struct sock *sk)
>                 inet_csk_reset_xmit_timer(sk, ICSK_TIME_RETRANS, rto,
>                                           TCP_RTO_MAX);
>         }
> -       tp->early_retrans_delayed = 0;
>  }
>
>  /* This function is called when the delayed ER timer fires. TCP enters
> @@ -3601,7 +3603,8 @@ static int tcp_ack(struct sock *sk, const struct sk_buff *skb, int flag)
>         if (after(ack, tp->snd_nxt))
>                 goto invalid_ack;
>
> -       if (tp->early_retrans_delayed)
> +       if (icsk->icsk_pending == ICSK_TIME_EARLY_RETRANS ||
> +           icsk->icsk_pending == ICSK_TIME_LOSS_PROBE)
>                 tcp_rearm_rto(sk);
>
>         if (after(ack, prior_snd_una))
> @@ -3678,6 +3681,9 @@ static int tcp_ack(struct sock *sk, const struct sk_buff *skb, int flag)
>                 if (dst)
>                         dst_confirm(dst);
>         }
> +
> +       if (icsk->icsk_pending == ICSK_TIME_RETRANS)
> +               tcp_schedule_loss_probe(sk);
>         return 1;
>
>  no_queue:
> diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
> index 8cdee12..b7ab868 100644
> --- a/net/ipv4/tcp_ipv4.c
> +++ b/net/ipv4/tcp_ipv4.c
> @@ -2703,7 +2703,9 @@ static void get_tcp4_sock(struct sock *sk, struct seq_file *f, int i, int *len)
>         __u16 srcp = ntohs(inet->inet_sport);
>         int rx_queue;
>
> -       if (icsk->icsk_pending == ICSK_TIME_RETRANS) {
> +       if (icsk->icsk_pending == ICSK_TIME_RETRANS ||
> +           icsk->icsk_pending == ICSK_TIME_EARLY_RETRANS ||
> +           icsk->icsk_pending == ICSK_TIME_LOSS_PROBE) {
>                 timer_active    = 1;
>                 timer_expires   = icsk->icsk_timeout;
>         } else if (icsk->icsk_pending == ICSK_TIME_PROBE0) {
> diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
> index e2b4461..beb63db 100644
> --- a/net/ipv4/tcp_output.c
> +++ b/net/ipv4/tcp_output.c
> @@ -74,6 +74,7 @@ static bool tcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle,
>  /* Account for new data that has been sent to the network. */
>  static void tcp_event_new_data_sent(struct sock *sk, const struct sk_buff *skb)
>  {
> +       struct inet_connection_sock *icsk = inet_csk(sk);
>         struct tcp_sock *tp = tcp_sk(sk);
>         unsigned int prior_packets = tp->packets_out;
>
> @@ -85,7 +86,8 @@ static void tcp_event_new_data_sent(struct sock *sk, const struct sk_buff *skb)
>                 tp->frto_counter = 3;
>
>         tp->packets_out += tcp_skb_pcount(skb);
> -       if (!prior_packets || tp->early_retrans_delayed)
> +       if (!prior_packets || icsk->icsk_pending == ICSK_TIME_EARLY_RETRANS ||
> +           icsk->icsk_pending == ICSK_TIME_LOSS_PROBE)
>                 tcp_rearm_rto(sk);
>  }
>
> @@ -1959,6 +1961,9 @@ static int tcp_mtu_probe(struct sock *sk)
>   * snd_up-64k-mss .. snd_up cannot be large. However, taking into
>   * account rare use of URG, this is not a big flaw.
>   *
> + * Send at most one packet when push_one > 0. Temporarily ignore
> + * cwnd limit to force at most one packet out when push_one == 2.
> +
>   * Returns true, if no segments are in flight and we have queued segments,
>   * but cannot send anything now because of SWS or another problem.
>   */
> @@ -1994,8 +1999,13 @@ static bool tcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle,
>                         goto repair; /* Skip network transmission */
>
>                 cwnd_quota = tcp_cwnd_test(tp, skb);
> -               if (!cwnd_quota)
> -                       break;
> +               if (!cwnd_quota) {
> +                       if (push_one == 2)
> +                               /* Force out a loss probe pkt. */
> +                               cwnd_quota = 1;
> +                       else
> +                               break;
> +               }
>
>                 if (unlikely(!tcp_snd_wnd_test(tp, skb, mss_now)))
>                         break;
> @@ -2049,10 +2059,120 @@ repair:
>         if (likely(sent_pkts)) {
>                 if (tcp_in_cwnd_reduction(sk))
>                         tp->prr_out += sent_pkts;
> +
> +               /* Send one loss probe per tail loss episode. */
> +               if (push_one != 2)
> +                       tcp_schedule_loss_probe(sk);
>                 tcp_cwnd_validate(sk);
>                 return false;
>         }
> -       return !tp->packets_out && tcp_send_head(sk);
> +       return (push_one == 2) || (!tp->packets_out && tcp_send_head(sk));
> +}
> +
> +bool tcp_schedule_loss_probe(struct sock *sk)
> +{
> +       struct inet_connection_sock *icsk = inet_csk(sk);
> +       struct tcp_sock *tp = tcp_sk(sk);
> +       u32 timeout, tlp_time_stamp, rto_time_stamp;
> +       u32 rtt = tp->srtt >> 3;
> +
> +       if (WARN_ON(icsk->icsk_pending == ICSK_TIME_EARLY_RETRANS))
> +               return false;
> +       /* No consecutive loss probes. */
> +       if (WARN_ON(icsk->icsk_pending == ICSK_TIME_LOSS_PROBE)) {
> +               tcp_rearm_rto(sk);
> +               return false;
> +       }
> +       /* Don't do any loss probe on a Fast Open connection before 3WHS
> +        * finishes.
> +        */
> +       if (sk->sk_state == TCP_SYN_RECV)
> +               return false;
> +
> +       /* TLP is only scheduled when next timer event is RTO. */
> +       if (icsk->icsk_pending != ICSK_TIME_RETRANS)
> +               return false;
> +
> +       /* Schedule a loss probe in 2*RTT for SACK capable connections
> +        * in Open state, that are either limited by cwnd or application.
> +        */
> +       if (sysctl_tcp_early_retrans < 3 || !rtt || !tp->packets_out ||
> +           !tcp_is_sack(tp) || inet_csk(sk)->icsk_ca_state != TCP_CA_Open)
> +               return false;
> +
> +       if ((tp->snd_cwnd > tcp_packets_in_flight(tp)) &&
> +            tcp_send_head(sk))
> +               return false;
> +
> +       /* Probe timeout is at least 1.5*rtt + TCP_DELACK_MAX to account
> +        * for delayed ack when there's one outstanding packet.
> +        */
> +       timeout = rtt << 1;
> +       if (tp->packets_out == 1)
> +               timeout = max_t(u32, timeout,
> +                               (rtt + (rtt >> 1) + TCP_DELACK_MAX));
> +       timeout = max_t(u32, timeout, msecs_to_jiffies(10));
> +
> +       /* If RTO is shorter, just schedule TLP in its place. */
> +       tlp_time_stamp = tcp_time_stamp + timeout;
> +       rto_time_stamp = (u32)inet_csk(sk)->icsk_timeout;
> +       if ((s32)(tlp_time_stamp - rto_time_stamp) > 0) {
> +               s32 delta = rto_time_stamp - tcp_time_stamp;
> +               if (delta > 0)
> +                       timeout = delta;
> +       }
> +
> +       inet_csk_reset_xmit_timer(sk, ICSK_TIME_LOSS_PROBE, timeout,
> +                                 TCP_RTO_MAX);
> +       return true;
> +}
> +
> +/* When probe timeout (PTO) fires, send a new segment if one exists, else
> + * retransmit the last segment.
> + */
> +void tcp_send_loss_probe(struct sock *sk)
> +{
> +       struct sk_buff *skb;
> +       int pcount;
> +       int mss = tcp_current_mss(sk);
> +       int err = -1;
> +
> +       if (tcp_send_head(sk) != NULL) {
> +               err = tcp_write_xmit(sk, mss, TCP_NAGLE_OFF, 2, GFP_ATOMIC);
> +               goto rearm_timer;
> +       }
> +
> +       /* Retransmit last segment. */
> +       skb = tcp_write_queue_tail(sk);
> +       if (WARN_ON(!skb))
> +               goto rearm_timer;
> +
> +       pcount = tcp_skb_pcount(skb);
> +       if (WARN_ON(!pcount))
> +               goto rearm_timer;
> +
> +       if ((pcount > 1) && (skb->len > (pcount - 1) * mss)) {
> +               if (unlikely(tcp_fragment(sk, skb, (pcount - 1) * mss, mss)))
> +                       goto rearm_timer;
> +               skb = tcp_write_queue_tail(sk);
> +       }
> +
> +       if (WARN_ON(!skb || !tcp_skb_pcount(skb)))
> +               goto rearm_timer;
> +
> +       /* Probe with zero data doesn't trigger fast recovery. */
> +       if (skb->len > 0)
> +               err = __tcp_retransmit_skb(sk, skb);
> +
> +rearm_timer:
> +       inet_csk_reset_xmit_timer(sk, ICSK_TIME_RETRANS,
> +                                 inet_csk(sk)->icsk_rto,
> +                                 TCP_RTO_MAX);
> +
> +       if (likely(!err))
> +               NET_INC_STATS_BH(sock_net(sk),
> +                                LINUX_MIB_TCPLOSSPROBES);
> +       return;
>  }
>
>  /* Push out any pending frames which were held back due to
> diff --git a/net/ipv4/tcp_timer.c b/net/ipv4/tcp_timer.c
> index b78aac3..ecd61d5 100644
> --- a/net/ipv4/tcp_timer.c
> +++ b/net/ipv4/tcp_timer.c
> @@ -342,10 +342,6 @@ void tcp_retransmit_timer(struct sock *sk)
>         struct tcp_sock *tp = tcp_sk(sk);
>         struct inet_connection_sock *icsk = inet_csk(sk);
>
> -       if (tp->early_retrans_delayed) {
> -               tcp_resume_early_retransmit(sk);
> -               return;
> -       }
>         if (tp->fastopen_rsk) {
>                 WARN_ON_ONCE(sk->sk_state != TCP_SYN_RECV &&
>                              sk->sk_state != TCP_FIN_WAIT1);
> @@ -495,13 +491,20 @@ void tcp_write_timer_handler(struct sock *sk)
>         }
>
>         event = icsk->icsk_pending;
> -       icsk->icsk_pending = 0;
>
>         switch (event) {
> +       case ICSK_TIME_EARLY_RETRANS:
> +               tcp_resume_early_retransmit(sk);
> +               break;
> +       case ICSK_TIME_LOSS_PROBE:
> +               tcp_send_loss_probe(sk);
> +               break;
>         case ICSK_TIME_RETRANS:
> +               icsk->icsk_pending = 0;
>                 tcp_retransmit_timer(sk);
>                 break;
>         case ICSK_TIME_PROBE0:
> +               icsk->icsk_pending = 0;
>                 tcp_probe_timer(sk);
>                 break;
>         }
> --
> 1.8.1.3
>
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html