[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAK6E8=eVEXe5gn1vFVD9=OtF0nbq0DtoABHreh-5xdut1h5BhQ@mail.gmail.com>
Date: Mon, 11 Mar 2013 18:47:00 -0400
From: Yuchung Cheng <ycheng@...gle.com>
To: Nandita Dukkipati <nanditad@...gle.com>
Cc: "David S. Miller" <davem@...emloft.net>,
Neal Cardwell <ncardwell@...gle.com>,
Eric Dumazet <edumazet@...gle.com>,
netdev <netdev@...r.kernel.org>,
Ilpo Jarvinen <ilpo.jarvinen@...helsinki.fi>,
Tom Herbert <therbert@...gle.com>
Subject: Re: [PATCH 1/2] tcp: Tail loss probe (TLP)
On Mon, Mar 11, 2013 at 4:00 PM, Nandita Dukkipati <nanditad@...gle.com> wrote:
> This patch series implement the Tail loss probe (TLP) algorithm described
> in http://tools.ietf.org/html/draft-dukkipati-tcpm-tcp-loss-probe-01. The
> first patch implements the basic algorithm.
>
> TLP's goal is to reduce tail latency of short transactions. It achieves
> this by converting retransmission timeouts (RTOs) occuring due
> to tail losses (losses at end of transactions) into fast recovery.
> TLP transmits one packet in two round-trips when a connection is in
> Open state and isn't receiving any ACKs. The transmitted packet, aka
> loss probe, can be either new or a retransmission. When there is tail
> loss, the ACK from a loss probe triggers FACK/early-retransmit based
> fast recovery, thus avoiding a costly RTO. In the absence of loss,
> there is no change in the connection state.
>
> PTO stands for probe timeout. It is a timer event indicating
> that an ACK is overdue and triggers a loss probe packet. The PTO value
> is set to max(2*SRTT, 10ms) and is adjusted to account for delayed
> ACK timer when there is only one oustanding packet.
>
> TLP Algorithm
>
> On transmission of new data in Open state:
> -> packets_out > 1: schedule PTO in max(2*SRTT, 10ms).
> -> packets_out == 1: schedule PTO in max(2*RTT, 1.5*RTT + 200ms)
> -> PTO = min(PTO, RTO)
>
> Conditions for scheduling PTO:
> -> Connection is in Open state.
> -> Connection is either cwnd limited or no new data to send.
> -> Number of probes per tail loss episode is limited to one.
> -> Connection is SACK enabled.
>
> When PTO fires:
> new_segment_exists:
> -> transmit new segment.
> -> packets_out++. cwnd remains same.
>
> no_new_packet:
> -> retransmit the last segment.
> Its ACK triggers FACK or early retransmit based recovery.
>
> ACK path:
> -> rearm RTO at start of ACK processing.
> -> reschedule PTO if need be.
>
> In addition, the patch includes a small variation to the Early Retransmit
> (ER) algorithm, such that ER and TLP together can in principle recover any
> N-degree of tail loss through fast recovery. TLP is controlled by the same
> sysctl as ER, tcp_early_retrans sysctl.
> tcp_early_retrans==0; disables TLP and ER.
> ==1; enables RFC5827 ER.
> ==2; delayed ER.
> ==3; TLP and delayed ER. [DEFAULT]
> ==4; TLP only.
>
> The TLP patch series have been extensively tested on Google Web servers.
> It is most effective for short Web trasactions, where it reduced RTOs by 15%
> and improved HTTP response time (average by 6%, 99th percentile by 10%).
> The transmitted probes account for <0.5% of the overall transmissions.
>
> Signed-off-by: Nandita Dukkipati <nanditad@...gle.com>
> ---
Acked-by: Yuchung Cheng <ycheng@...gle.com>
> Documentation/networking/ip-sysctl.txt | 8 ++-
> include/linux/tcp.h | 1 -
> include/net/inet_connection_sock.h | 5 +-
> include/net/tcp.h | 6 +-
> include/uapi/linux/snmp.h | 1 +
> net/ipv4/inet_diag.c | 4 +-
> net/ipv4/proc.c | 1 +
> net/ipv4/sysctl_net_ipv4.c | 4 +-
> net/ipv4/tcp_input.c | 24 ++++---
> net/ipv4/tcp_ipv4.c | 4 +-
> net/ipv4/tcp_output.c | 128 +++++++++++++++++++++++++++++++--
> net/ipv4/tcp_timer.c | 13 ++--
> 12 files changed, 171 insertions(+), 28 deletions(-)
>
> diff --git a/Documentation/networking/ip-sysctl.txt b/Documentation/networking/ip-sysctl.txt
> index dc2dc87..1cae6c3 100644
> --- a/Documentation/networking/ip-sysctl.txt
> +++ b/Documentation/networking/ip-sysctl.txt
> @@ -190,7 +190,9 @@ tcp_early_retrans - INTEGER
> Enable Early Retransmit (ER), per RFC 5827. ER lowers the threshold
> for triggering fast retransmit when the amount of outstanding data is
> small and when no previously unsent data can be transmitted (such
> - that limited transmit could be used).
> + that limited transmit could be used). Also controls the use of
> + Tail loss probe (TLP) that converts RTOs occuring due to tail
> + losses into fast recovery (draft-dukkipati-tcpm-tcp-loss-probe-01).
> Possible values:
> 0 disables ER
> 1 enables ER
> @@ -198,7 +200,9 @@ tcp_early_retrans - INTEGER
> by a fourth of RTT. This mitigates connection falsely
> recovers when network has a small degree of reordering
> (less than 3 packets).
> - Default: 2
> + 3 enables delayed ER and TLP.
> + 4 enables TLP only.
> + Default: 3
>
> tcp_ecn - INTEGER
> Control use of Explicit Congestion Notification (ECN) by TCP.
> diff --git a/include/linux/tcp.h b/include/linux/tcp.h
> index 515c374..01860d7 100644
> --- a/include/linux/tcp.h
> +++ b/include/linux/tcp.h
> @@ -201,7 +201,6 @@ struct tcp_sock {
> unused : 1;
> u8 repair_queue;
> u8 do_early_retrans:1,/* Enable RFC5827 early-retransmit */
> - early_retrans_delayed:1, /* Delayed ER timer installed */
> syn_data:1, /* SYN includes data */
> syn_fastopen:1, /* SYN includes Fast Open option */
> syn_data_acked:1;/* data in SYN is acked by SYN-ACK */
> diff --git a/include/net/inet_connection_sock.h b/include/net/inet_connection_sock.h
> index 1832927..de2c785 100644
> --- a/include/net/inet_connection_sock.h
> +++ b/include/net/inet_connection_sock.h
> @@ -133,6 +133,8 @@ struct inet_connection_sock {
> #define ICSK_TIME_RETRANS 1 /* Retransmit timer */
> #define ICSK_TIME_DACK 2 /* Delayed ack timer */
> #define ICSK_TIME_PROBE0 3 /* Zero window probe timer */
> +#define ICSK_TIME_EARLY_RETRANS 4 /* Early retransmit timer */
> +#define ICSK_TIME_LOSS_PROBE 5 /* Tail loss probe timer */
>
> static inline struct inet_connection_sock *inet_csk(const struct sock *sk)
> {
> @@ -222,7 +224,8 @@ static inline void inet_csk_reset_xmit_timer(struct sock *sk, const int what,
> when = max_when;
> }
>
> - if (what == ICSK_TIME_RETRANS || what == ICSK_TIME_PROBE0) {
> + if (what == ICSK_TIME_RETRANS || what == ICSK_TIME_PROBE0 ||
> + what == ICSK_TIME_EARLY_RETRANS || what == ICSK_TIME_LOSS_PROBE) {
> icsk->icsk_pending = what;
> icsk->icsk_timeout = jiffies + when;
> sk_reset_timer(sk, &icsk->icsk_retransmit_timer, icsk->icsk_timeout);
> diff --git a/include/net/tcp.h b/include/net/tcp.h
> index a2baa5e..ab9f947 100644
> --- a/include/net/tcp.h
> +++ b/include/net/tcp.h
> @@ -543,6 +543,8 @@ extern bool tcp_syn_flood_action(struct sock *sk,
> extern void tcp_push_one(struct sock *, unsigned int mss_now);
> extern void tcp_send_ack(struct sock *sk);
> extern void tcp_send_delayed_ack(struct sock *sk);
> +extern void tcp_send_loss_probe(struct sock *sk);
> +extern bool tcp_schedule_loss_probe(struct sock *sk);
>
> /* tcp_input.c */
> extern void tcp_cwnd_application_limited(struct sock *sk);
> @@ -873,8 +875,8 @@ static inline void tcp_enable_fack(struct tcp_sock *tp)
> static inline void tcp_enable_early_retrans(struct tcp_sock *tp)
> {
> tp->do_early_retrans = sysctl_tcp_early_retrans &&
> - !sysctl_tcp_thin_dupack && sysctl_tcp_reordering == 3;
> - tp->early_retrans_delayed = 0;
> + sysctl_tcp_early_retrans < 4 && !sysctl_tcp_thin_dupack &&
> + sysctl_tcp_reordering == 3;
> }
>
> static inline void tcp_disable_early_retrans(struct tcp_sock *tp)
> diff --git a/include/uapi/linux/snmp.h b/include/uapi/linux/snmp.h
> index b49eab8..290bed6 100644
> --- a/include/uapi/linux/snmp.h
> +++ b/include/uapi/linux/snmp.h
> @@ -202,6 +202,7 @@ enum
> LINUX_MIB_TCPFORWARDRETRANS, /* TCPForwardRetrans */
> LINUX_MIB_TCPSLOWSTARTRETRANS, /* TCPSlowStartRetrans */
> LINUX_MIB_TCPTIMEOUTS, /* TCPTimeouts */
> + LINUX_MIB_TCPLOSSPROBES, /* TCPLossProbes */
> LINUX_MIB_TCPRENORECOVERYFAIL, /* TCPRenoRecoveryFail */
> LINUX_MIB_TCPSACKRECOVERYFAIL, /* TCPSackRecoveryFail */
> LINUX_MIB_TCPSCHEDULERFAILED, /* TCPSchedulerFailed */
> diff --git a/net/ipv4/inet_diag.c b/net/ipv4/inet_diag.c
> index 7afa2c3..8620408 100644
> --- a/net/ipv4/inet_diag.c
> +++ b/net/ipv4/inet_diag.c
> @@ -158,7 +158,9 @@ int inet_sk_diag_fill(struct sock *sk, struct inet_connection_sock *icsk,
>
> #define EXPIRES_IN_MS(tmo) DIV_ROUND_UP((tmo - jiffies) * 1000, HZ)
>
> - if (icsk->icsk_pending == ICSK_TIME_RETRANS) {
> + if (icsk->icsk_pending == ICSK_TIME_RETRANS ||
> + icsk->icsk_pending == ICSK_TIME_EARLY_RETRANS ||
> + icsk->icsk_pending == ICSK_TIME_LOSS_PROBE) {
> r->idiag_timer = 1;
> r->idiag_retrans = icsk->icsk_retransmits;
> r->idiag_expires = EXPIRES_IN_MS(icsk->icsk_timeout);
> diff --git a/net/ipv4/proc.c b/net/ipv4/proc.c
> index 32030a2..4c35911 100644
> --- a/net/ipv4/proc.c
> +++ b/net/ipv4/proc.c
> @@ -224,6 +224,7 @@ static const struct snmp_mib snmp4_net_list[] = {
> SNMP_MIB_ITEM("TCPForwardRetrans", LINUX_MIB_TCPFORWARDRETRANS),
> SNMP_MIB_ITEM("TCPSlowStartRetrans", LINUX_MIB_TCPSLOWSTARTRETRANS),
> SNMP_MIB_ITEM("TCPTimeouts", LINUX_MIB_TCPTIMEOUTS),
> + SNMP_MIB_ITEM("TCPLossProbes", LINUX_MIB_TCPLOSSPROBES),
> SNMP_MIB_ITEM("TCPRenoRecoveryFail", LINUX_MIB_TCPRENORECOVERYFAIL),
> SNMP_MIB_ITEM("TCPSackRecoveryFail", LINUX_MIB_TCPSACKRECOVERYFAIL),
> SNMP_MIB_ITEM("TCPSchedulerFailed", LINUX_MIB_TCPSCHEDULERFAILED),
> diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c
> index 960fd29..cca4550 100644
> --- a/net/ipv4/sysctl_net_ipv4.c
> +++ b/net/ipv4/sysctl_net_ipv4.c
> @@ -28,7 +28,7 @@
>
> static int zero;
> static int one = 1;
> -static int two = 2;
> +static int four = 4;
> static int tcp_retr1_max = 255;
> static int ip_local_port_range_min[] = { 1, 1 };
> static int ip_local_port_range_max[] = { 65535, 65535 };
> @@ -760,7 +760,7 @@ static struct ctl_table ipv4_table[] = {
> .mode = 0644,
> .proc_handler = proc_dointvec_minmax,
> .extra1 = &zero,
> - .extra2 = &two,
> + .extra2 = &four,
> },
> {
> .procname = "udp_mem",
> diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
> index 0d9bdac..b794f89 100644
> --- a/net/ipv4/tcp_input.c
> +++ b/net/ipv4/tcp_input.c
> @@ -98,7 +98,7 @@ int sysctl_tcp_frto_response __read_mostly;
> int sysctl_tcp_thin_dupack __read_mostly;
>
> int sysctl_tcp_moderate_rcvbuf __read_mostly = 1;
> -int sysctl_tcp_early_retrans __read_mostly = 2;
> +int sysctl_tcp_early_retrans __read_mostly = 3;
>
> #define FLAG_DATA 0x01 /* Incoming frame contained data. */
> #define FLAG_WIN_UPDATE 0x02 /* Incoming ACK was a window update. */
> @@ -2150,15 +2150,16 @@ static bool tcp_pause_early_retransmit(struct sock *sk, int flag)
> * max(RTT/4, 2msec) unless ack has ECE mark, no RTT samples
> * available, or RTO is scheduled to fire first.
> */
> - if (sysctl_tcp_early_retrans < 2 || (flag & FLAG_ECE) || !tp->srtt)
> + if (sysctl_tcp_early_retrans < 2 || sysctl_tcp_early_retrans > 3 ||
> + (flag & FLAG_ECE) || !tp->srtt)
> return false;
>
> delay = max_t(unsigned long, (tp->srtt >> 5), msecs_to_jiffies(2));
> if (!time_after(inet_csk(sk)->icsk_timeout, (jiffies + delay)))
> return false;
>
> - inet_csk_reset_xmit_timer(sk, ICSK_TIME_RETRANS, delay, TCP_RTO_MAX);
> - tp->early_retrans_delayed = 1;
> + inet_csk_reset_xmit_timer(sk, ICSK_TIME_EARLY_RETRANS, delay,
> + TCP_RTO_MAX);
> return true;
> }
>
> @@ -2321,7 +2322,7 @@ static bool tcp_time_to_recover(struct sock *sk, int flag)
> * interval if appropriate.
> */
> if (tp->do_early_retrans && !tp->retrans_out && tp->sacked_out &&
> - (tp->packets_out == (tp->sacked_out + 1) && tp->packets_out < 4) &&
> + (tp->packets_out >= (tp->sacked_out + 1) && tp->packets_out < 4) &&
> !tcp_may_send_now(sk))
> return !tcp_pause_early_retransmit(sk, flag);
>
> @@ -3081,6 +3082,7 @@ static void tcp_cong_avoid(struct sock *sk, u32 ack, u32 in_flight)
> */
> void tcp_rearm_rto(struct sock *sk)
> {
> + const struct inet_connection_sock *icsk = inet_csk(sk);
> struct tcp_sock *tp = tcp_sk(sk);
>
> /* If the retrans timer is currently being used by Fast Open
> @@ -3094,12 +3096,13 @@ void tcp_rearm_rto(struct sock *sk)
> } else {
> u32 rto = inet_csk(sk)->icsk_rto;
> /* Offset the time elapsed after installing regular RTO */
> - if (tp->early_retrans_delayed) {
> + if (icsk->icsk_pending == ICSK_TIME_EARLY_RETRANS ||
> + icsk->icsk_pending == ICSK_TIME_LOSS_PROBE) {
> struct sk_buff *skb = tcp_write_queue_head(sk);
> const u32 rto_time_stamp = TCP_SKB_CB(skb)->when + rto;
> s32 delta = (s32)(rto_time_stamp - tcp_time_stamp);
> /* delta may not be positive if the socket is locked
> - * when the delayed ER timer fires and is rescheduled.
> + * when the retrans timer fires and is rescheduled.
> */
> if (delta > 0)
> rto = delta;
> @@ -3107,7 +3110,6 @@ void tcp_rearm_rto(struct sock *sk)
> inet_csk_reset_xmit_timer(sk, ICSK_TIME_RETRANS, rto,
> TCP_RTO_MAX);
> }
> - tp->early_retrans_delayed = 0;
> }
>
> /* This function is called when the delayed ER timer fires. TCP enters
> @@ -3601,7 +3603,8 @@ static int tcp_ack(struct sock *sk, const struct sk_buff *skb, int flag)
> if (after(ack, tp->snd_nxt))
> goto invalid_ack;
>
> - if (tp->early_retrans_delayed)
> + if (icsk->icsk_pending == ICSK_TIME_EARLY_RETRANS ||
> + icsk->icsk_pending == ICSK_TIME_LOSS_PROBE)
> tcp_rearm_rto(sk);
>
> if (after(ack, prior_snd_una))
> @@ -3678,6 +3681,9 @@ static int tcp_ack(struct sock *sk, const struct sk_buff *skb, int flag)
> if (dst)
> dst_confirm(dst);
> }
> +
> + if (icsk->icsk_pending == ICSK_TIME_RETRANS)
> + tcp_schedule_loss_probe(sk);
> return 1;
>
> no_queue:
> diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
> index 8cdee12..b7ab868 100644
> --- a/net/ipv4/tcp_ipv4.c
> +++ b/net/ipv4/tcp_ipv4.c
> @@ -2703,7 +2703,9 @@ static void get_tcp4_sock(struct sock *sk, struct seq_file *f, int i, int *len)
> __u16 srcp = ntohs(inet->inet_sport);
> int rx_queue;
>
> - if (icsk->icsk_pending == ICSK_TIME_RETRANS) {
> + if (icsk->icsk_pending == ICSK_TIME_RETRANS ||
> + icsk->icsk_pending == ICSK_TIME_EARLY_RETRANS ||
> + icsk->icsk_pending == ICSK_TIME_LOSS_PROBE) {
> timer_active = 1;
> timer_expires = icsk->icsk_timeout;
> } else if (icsk->icsk_pending == ICSK_TIME_PROBE0) {
> diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
> index e2b4461..beb63db 100644
> --- a/net/ipv4/tcp_output.c
> +++ b/net/ipv4/tcp_output.c
> @@ -74,6 +74,7 @@ static bool tcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle,
> /* Account for new data that has been sent to the network. */
> static void tcp_event_new_data_sent(struct sock *sk, const struct sk_buff *skb)
> {
> + struct inet_connection_sock *icsk = inet_csk(sk);
> struct tcp_sock *tp = tcp_sk(sk);
> unsigned int prior_packets = tp->packets_out;
>
> @@ -85,7 +86,8 @@ static void tcp_event_new_data_sent(struct sock *sk, const struct sk_buff *skb)
> tp->frto_counter = 3;
>
> tp->packets_out += tcp_skb_pcount(skb);
> - if (!prior_packets || tp->early_retrans_delayed)
> + if (!prior_packets || icsk->icsk_pending == ICSK_TIME_EARLY_RETRANS ||
> + icsk->icsk_pending == ICSK_TIME_LOSS_PROBE)
> tcp_rearm_rto(sk);
> }
>
> @@ -1959,6 +1961,9 @@ static int tcp_mtu_probe(struct sock *sk)
> * snd_up-64k-mss .. snd_up cannot be large. However, taking into
> * account rare use of URG, this is not a big flaw.
> *
> + * Send at most one packet when push_one > 0. Temporarily ignore
> + * cwnd limit to force at most one packet out when push_one == 2.
> +
> * Returns true, if no segments are in flight and we have queued segments,
> * but cannot send anything now because of SWS or another problem.
> */
> @@ -1994,8 +1999,13 @@ static bool tcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle,
> goto repair; /* Skip network transmission */
>
> cwnd_quota = tcp_cwnd_test(tp, skb);
> - if (!cwnd_quota)
> - break;
> + if (!cwnd_quota) {
> + if (push_one == 2)
> + /* Force out a loss probe pkt. */
> + cwnd_quota = 1;
> + else
> + break;
> + }
>
> if (unlikely(!tcp_snd_wnd_test(tp, skb, mss_now)))
> break;
> @@ -2049,10 +2059,120 @@ repair:
> if (likely(sent_pkts)) {
> if (tcp_in_cwnd_reduction(sk))
> tp->prr_out += sent_pkts;
> +
> + /* Send one loss probe per tail loss episode. */
> + if (push_one != 2)
> + tcp_schedule_loss_probe(sk);
> tcp_cwnd_validate(sk);
> return false;
> }
> - return !tp->packets_out && tcp_send_head(sk);
> + return (push_one == 2) || (!tp->packets_out && tcp_send_head(sk));
> +}
> +
> +bool tcp_schedule_loss_probe(struct sock *sk)
> +{
> + struct inet_connection_sock *icsk = inet_csk(sk);
> + struct tcp_sock *tp = tcp_sk(sk);
> + u32 timeout, tlp_time_stamp, rto_time_stamp;
> + u32 rtt = tp->srtt >> 3;
> +
> + if (WARN_ON(icsk->icsk_pending == ICSK_TIME_EARLY_RETRANS))
> + return false;
> + /* No consecutive loss probes. */
> + if (WARN_ON(icsk->icsk_pending == ICSK_TIME_LOSS_PROBE)) {
> + tcp_rearm_rto(sk);
> + return false;
> + }
> + /* Don't do any loss probe on a Fast Open connection before 3WHS
> + * finishes.
> + */
> + if (sk->sk_state == TCP_SYN_RECV)
> + return false;
> +
> + /* TLP is only scheduled when next timer event is RTO. */
> + if (icsk->icsk_pending != ICSK_TIME_RETRANS)
> + return false;
> +
> + /* Schedule a loss probe in 2*RTT for SACK capable connections
> + * in Open state, that are either limited by cwnd or application.
> + */
> + if (sysctl_tcp_early_retrans < 3 || !rtt || !tp->packets_out ||
> + !tcp_is_sack(tp) || inet_csk(sk)->icsk_ca_state != TCP_CA_Open)
> + return false;
> +
> + if ((tp->snd_cwnd > tcp_packets_in_flight(tp)) &&
> + tcp_send_head(sk))
> + return false;
> +
> + /* Probe timeout is at least 1.5*rtt + TCP_DELACK_MAX to account
> + * for delayed ack when there's one outstanding packet.
> + */
> + timeout = rtt << 1;
> + if (tp->packets_out == 1)
> + timeout = max_t(u32, timeout,
> + (rtt + (rtt >> 1) + TCP_DELACK_MAX));
> + timeout = max_t(u32, timeout, msecs_to_jiffies(10));
> +
> + /* If RTO is shorter, just schedule TLP in its place. */
> + tlp_time_stamp = tcp_time_stamp + timeout;
> + rto_time_stamp = (u32)inet_csk(sk)->icsk_timeout;
> + if ((s32)(tlp_time_stamp - rto_time_stamp) > 0) {
> + s32 delta = rto_time_stamp - tcp_time_stamp;
> + if (delta > 0)
> + timeout = delta;
> + }
> +
> + inet_csk_reset_xmit_timer(sk, ICSK_TIME_LOSS_PROBE, timeout,
> + TCP_RTO_MAX);
> + return true;
> +}
> +
> +/* When probe timeout (PTO) fires, send a new segment if one exists, else
> + * retransmit the last segment.
> + */
> +void tcp_send_loss_probe(struct sock *sk)
> +{
> + struct sk_buff *skb;
> + int pcount;
> + int mss = tcp_current_mss(sk);
> + int err = -1;
> +
> + if (tcp_send_head(sk) != NULL) {
> + err = tcp_write_xmit(sk, mss, TCP_NAGLE_OFF, 2, GFP_ATOMIC);
> + goto rearm_timer;
> + }
> +
> + /* Retransmit last segment. */
> + skb = tcp_write_queue_tail(sk);
> + if (WARN_ON(!skb))
> + goto rearm_timer;
> +
> + pcount = tcp_skb_pcount(skb);
> + if (WARN_ON(!pcount))
> + goto rearm_timer;
> +
> + if ((pcount > 1) && (skb->len > (pcount - 1) * mss)) {
> + if (unlikely(tcp_fragment(sk, skb, (pcount - 1) * mss, mss)))
> + goto rearm_timer;
> + skb = tcp_write_queue_tail(sk);
> + }
> +
> + if (WARN_ON(!skb || !tcp_skb_pcount(skb)))
> + goto rearm_timer;
> +
> + /* Probe with zero data doesn't trigger fast recovery. */
> + if (skb->len > 0)
> + err = __tcp_retransmit_skb(sk, skb);
> +
> +rearm_timer:
> + inet_csk_reset_xmit_timer(sk, ICSK_TIME_RETRANS,
> + inet_csk(sk)->icsk_rto,
> + TCP_RTO_MAX);
> +
> + if (likely(!err))
> + NET_INC_STATS_BH(sock_net(sk),
> + LINUX_MIB_TCPLOSSPROBES);
> + return;
> }
>
> /* Push out any pending frames which were held back due to
> diff --git a/net/ipv4/tcp_timer.c b/net/ipv4/tcp_timer.c
> index b78aac3..ecd61d5 100644
> --- a/net/ipv4/tcp_timer.c
> +++ b/net/ipv4/tcp_timer.c
> @@ -342,10 +342,6 @@ void tcp_retransmit_timer(struct sock *sk)
> struct tcp_sock *tp = tcp_sk(sk);
> struct inet_connection_sock *icsk = inet_csk(sk);
>
> - if (tp->early_retrans_delayed) {
> - tcp_resume_early_retransmit(sk);
> - return;
> - }
> if (tp->fastopen_rsk) {
> WARN_ON_ONCE(sk->sk_state != TCP_SYN_RECV &&
> sk->sk_state != TCP_FIN_WAIT1);
> @@ -495,13 +491,20 @@ void tcp_write_timer_handler(struct sock *sk)
> }
>
> event = icsk->icsk_pending;
> - icsk->icsk_pending = 0;
>
> switch (event) {
> + case ICSK_TIME_EARLY_RETRANS:
> + tcp_resume_early_retransmit(sk);
> + break;
> + case ICSK_TIME_LOSS_PROBE:
> + tcp_send_loss_probe(sk);
> + break;
> case ICSK_TIME_RETRANS:
> + icsk->icsk_pending = 0;
> tcp_retransmit_timer(sk);
> break;
> case ICSK_TIME_PROBE0:
> + icsk->icsk_pending = 0;
> tcp_probe_timer(sk);
> break;
> }
> --
> 1.8.1.3
>
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Powered by blists - more mailing lists