netdev - RE: [PATCH net 3/3] tcp: fix xmit timer to only be reset if data ACKed/SACKed

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <F95AC9340317A84688A5F0DF0246F3F2029373BA@DGGEMI512-MBX.china.huawei.com>
Date:   Fri, 4 Aug 2017 07:33:26 +0000
From:   maowenan <maowenan@...wei.com>
To:     maowenan <maowenan@...wei.com>,
        Neal Cardwell <ncardwell@...gle.com>,
        David Miller <davem@...emloft.net>
CC:     "netdev@...r.kernel.org" <netdev@...r.kernel.org>,
        Yuchung Cheng <ycheng@...gle.com>,
        Nandita Dukkipati <nanditad@...gle.com>
Subject: RE: [PATCH net 3/3] tcp: fix xmit timer to only be reset if data
 ACKed/SACKed



> -----Original Message-----
> From: maowenan
> Sent: Tuesday, August 01, 2017 8:20 PM
> To: 'Neal Cardwell'; David Miller
> Cc: netdev@...r.kernel.org; Yuchung Cheng; Nandita Dukkipati
> Subject: RE: [PATCH net 3/3] tcp: fix xmit timer to only be reset if data
> ACKed/SACKed
> 
> 
> 
> > -----Original Message-----
> > From: netdev-owner@...r.kernel.org
> > [mailto:netdev-owner@...r.kernel.org]
> > On Behalf Of Neal Cardwell
> > Sent: Tuesday, August 01, 2017 10:58 AM
> > To: David Miller
> > Cc: netdev@...r.kernel.org; Neal Cardwell; Yuchung Cheng; Nandita
> > Dukkipati
> > Subject: [PATCH net 3/3] tcp: fix xmit timer to only be reset if data
> > ACKed/SACKed
> >
> > Fix a TCP loss recovery performance bug raised recently on the netdev
> > list, in two threads:
> >
> > (i)  July 26, 2017: netdev thread "TCP fast retransmit issues"
> > (ii) July 26, 2017: netdev thread:
> >      "[PATCH V2 net-next] TLP: Don't reschedule PTO when there's one
> >      outstanding TLP retransmission"
> >
> > The basic problem is that incoming TCP packets that did not indicate
> > forward progress could cause the xmit timer (TLP or RTO) to be rearmed
> > and pushed back in time. In certain corner cases this could result in
> > the following problems noted in these threads:
> >
> >  - Repeated ACKs coming in with bogus SACKs corrupted by middleboxes
> >    could cause TCP to repeatedly schedule TLPs forever. We kept
> >    sending TLPs after every ~200ms, which elicited bogus SACKs, which
> >    caused more TLPs, ad infinitum; we never fired an RTO to fill in
> >    the holes.
> >
> >  - Incoming data segments could, in some cases, cause us to reschedule
> >    our RTO or TLP timer further out in time, for no good reason. This
> >    could cause repeated inbound data to result in stalls in outbound
> >    data, in the presence of packet loss.
> >
> > This commit fixes these bugs by changing the TLP and RTO ACK processing to:
> >
> >  (a) Only reschedule the xmit timer once per ACK.
> >
> >  (b) Only reschedule the xmit timer if tcp_clean_rtx_queue() deems the
> >      ACK indicates sufficient forward progress (a packet was
> >      cumulatively ACKed, or we got a SACK for a packet that was sent
> >      before the most recent retransmit of the write queue head).
> >
> > This brings us back into closer compliance with the RFCs, since, as
> > the comment for tcp_rearm_rto() notes, we should only restart the RTO
> > timer after forward progress on the connection. Previously we were
> > restarting the xmit timer even in these cases where there was no forward
> progress.
> >
> > As a side benefit, this commit simplifies and speeds up the TCP timer
> > arming logic. We had been calling inet_csk_reset_xmit_timer() three
> > times on normal ACKs that cumulatively acknowledged some data:
> >
> > 1) Once near the top of tcp_ack() to switch from TLP timer to RTO:
> >         if (icsk->icsk_pending == ICSK_TIME_LOSS_PROBE)
> >                tcp_rearm_rto(sk);
> >
> > 2) Once in tcp_clean_rtx_queue(), to update the RTO:
> >         if (flag & FLAG_ACKED) {
> >                tcp_rearm_rto(sk);
> >
> > 3) Once in tcp_ack() after tcp_fastretrans_alert() to switch from RTO
> >    to TLP:
> >         if (icsk->icsk_pending == ICSK_TIME_RETRANS)
> >                tcp_schedule_loss_probe(sk);
> >
> > This commit, by only rescheduling the xmit timer once per ACK,
> > simplifies the code and reduces CPU overhead.
> >
> > This commit was tested in an A/B test with Google web server traffic.
> > SNMP stats and request latency metrics were within noise levels,
> > substantiating that for normal web traffic patterns this is a rare
> > issue. This commit was also tested with packetdrill tests to verify
> > that it fixes the timer behavior in the corner cases discussed in the netdev
> threads mentioned above.
> >
> > This patch is a bug fix patch intended to be queued for -stable relases.
> >
> > Fixes: 6ba8a3b19e76 ("tcp: Tail loss probe (TLP)")
> > Reported-by: Klavs Klavsen <kl@...n.dk>
> > Reported-by: Mao Wenan <maowenan@...wei.com>
> > Signed-off-by: Neal Cardwell <ncardwell@...gle.com>
> > Signed-off-by: Yuchung Cheng <ycheng@...gle.com>
> > Signed-off-by: Nandita Dukkipati <nanditad@...gle.com>
> > ---
> >  net/ipv4/tcp_input.c  | 25 ++++++++++++++++---------
> > net/ipv4/tcp_output.c |  9 ---------
> >  2 files changed, 16 insertions(+), 18 deletions(-)
> >
> > diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c index
> > 345febf0a46e..3e777cfbba56 100644
> > --- a/net/ipv4/tcp_input.c
> > +++ b/net/ipv4/tcp_input.c
> > @@ -107,6 +107,7 @@ int sysctl_tcp_invalid_ratelimit __read_mostly =
> HZ/2;
> >  #define FLAG_ORIG_SACK_ACKED	0x200 /* Never retransmitted data are
> > (s)acked	*/
> >  #define FLAG_SND_UNA_ADVANCED	0x400 /* Snd_una was changed (!=
> > FLAG_DATA_ACKED) */
> >  #define FLAG_DSACKING_ACK	0x800 /* SACK blocks contained D-SACK info
> > */
> > +#define FLAG_SET_XMIT_TIMER	0x1000 /* Set TLP or RTO timer */
> >  #define FLAG_SACK_RENEGING	0x2000 /* snd_una advanced to a
> sacked
> > seq */
> >  #define FLAG_UPDATE_TS_RECENT	0x4000 /* tcp_replace_ts_recent() */
> >  #define FLAG_NO_CHALLENGE_ACK	0x8000 /* do not call
> > tcp_send_challenge_ack()	*/
> > @@ -3016,6 +3017,13 @@ void tcp_rearm_rto(struct sock *sk)
> >  	}
> >  }
> >
> > +/* Try to schedule a loss probe; if that doesn't work, then schedule
> > +an RTO. */ static void tcp_set_xmit_timer(struct sock *sk) {
> > +	if (!tcp_schedule_loss_probe(sk))
> > +		tcp_rearm_rto(sk);
> > +}
> > +
> >  /* If we get here, the whole TSO packet has not been acked. */
> > static u32 tcp_tso_acked(struct sock *sk, struct sk_buff *skb)  { @@
> > -3177,7 +3185,7 @@ static int tcp_clean_rtx_queue(struct sock *sk, int
> prior_fackets,
> >  					ca_rtt_us, sack->rate);
> >
> >  	if (flag & FLAG_ACKED) {
> > -		tcp_rearm_rto(sk);
> > +		flag |= FLAG_SET_XMIT_TIMER;  /* set TLP or RTO timer */
> >  		if (unlikely(icsk->icsk_mtup.probe_size &&
> >  			     !after(tp->mtu_probe.probe_seq_end, tp->snd_una))) {
> >  			tcp_mtup_probe_success(sk);
> > @@ -3205,7 +3213,7 @@ static int tcp_clean_rtx_queue(struct sock *sk,
> > int prior_fackets,
> >  		 * after when the head was last (re)transmitted. Otherwise the
> >  		 * timeout may continue to extend in loss recovery.
> >  		 */
> > -		tcp_rearm_rto(sk);
> > +		flag |= FLAG_SET_XMIT_TIMER;  /* set TLP or RTO timer */
> >  	}
> >
> >  	if (icsk->icsk_ca_ops->pkts_acked) { @@ -3577,9 +3585,6 @@ static
> > int tcp_ack(struct sock *sk, const struct sk_buff *skb, int flag)
> >  	if (after(ack, tp->snd_nxt))
> >  		goto invalid_ack;
> >
> > -	if (icsk->icsk_pending == ICSK_TIME_LOSS_PROBE)
> > -		tcp_rearm_rto(sk);
> > -
> >  	if (after(ack, prior_snd_una)) {
> >  		flag |= FLAG_SND_UNA_ADVANCED;
> >  		icsk->icsk_retransmits = 0;
> > @@ -3644,18 +3649,20 @@ static int tcp_ack(struct sock *sk, const
> > struct sk_buff *skb, int flag)
> >  	flag |= tcp_clean_rtx_queue(sk, prior_fackets, prior_snd_una, &acked,
> >  				    &sack_state);
> >
> > +	if (tp->tlp_high_seq)
> > +		tcp_process_tlp_ack(sk, ack, flag);
> > +	/* If needed, reset TLP/RTO timer; RACK may later override this. */
> [Mao Wenan] I have question about RACK, if there is no RACK feature in lower
> version, who can clear this flag:FLAG_SET_XMIT_TIMER?
> 
> > +	if (flag & FLAG_SET_XMIT_TIMER)
> > +		tcp_set_xmit_timer(sk);
> > +
> >  	if (tcp_ack_is_dubious(sk, flag)) {
> >  		is_dupack = !(flag & (FLAG_SND_UNA_ADVANCED |
> FLAG_NOT_DUP));
> >  		tcp_fastretrans_alert(sk, acked, is_dupack, &flag, &rexmit);
> >  	}
> > -	if (tp->tlp_high_seq)
> > -		tcp_process_tlp_ack(sk, ack, flag);
> >
> >  	if ((flag & FLAG_FORWARD_PROGRESS) || !(flag & FLAG_NOT_DUP))
> >  		sk_dst_confirm(sk);
> >
> > -	if (icsk->icsk_pending == ICSK_TIME_RETRANS)
> > -		tcp_schedule_loss_probe(sk);
> >  	delivered = tp->delivered - delivered;	/* freshly ACKed or SACKed */
> >  	lost = tp->lost - lost;			/* freshly marked lost */
> >  	tcp_rate_gen(sk, delivered, lost, sack_state.rate); diff --git
> > a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c index
> > 0ae6b5d176c0..c99cba897b9c 100644
> > --- a/net/ipv4/tcp_output.c
> > +++ b/net/ipv4/tcp_output.c
> > @@ -2380,21 +2380,12 @@ bool tcp_schedule_loss_probe(struct sock *sk)
> >  	u32 rtt = usecs_to_jiffies(tp->srtt_us >> 3);
> >  	u32 timeout, rto_delta_us;
> >
> > -	/* No consecutive loss probes. */
> > -	if (WARN_ON(icsk->icsk_pending == ICSK_TIME_LOSS_PROBE)) {
> > -		tcp_rearm_rto(sk);
> > -		return false;
> > -	}
[Mao Wenan]Follow previous mail, in lower version such as 3.10, I found
there are many timer type, e.g:ICSK_TIME_EARLY_RETRANS, RTO,PTO
are used. I'm not sure there exist some unknown problem if we don't check 
isck_pending here and below if branch? And could you please post one lower
version patch such as 3.10?  Thanks a lot. 
 
#define ICSK_TIME_RETRANS	1	/* Retransmit timer */
#define ICSK_TIME_DACK		2	/* Delayed ack timer */
#define ICSK_TIME_PROBE0	3	/* Zero window probe timer */
#define ICSK_TIME_EARLY_RETRANS 4	/* Early retransmit timer */
#define ICSK_TIME_LOSS_PROBE	5	/* Tail loss probe timer */

> >  	/* Don't do any loss probe on a Fast Open connection before 3WHS
> >  	 * finishes.
> >  	 */
> >  	if (tp->fastopen_rsk)
> >  		return false;
> >
> > -	/* TLP is only scheduled when next timer event is RTO. */
> > -	if (icsk->icsk_pending != ICSK_TIME_RETRANS)
> > -		return false;
> > -
> >  	/* Schedule a loss probe in 2*RTT for SACK capable connections
> >  	 * in Open state, that are either limited by cwnd or application.
> >  	 */
> > --
> > 2.14.0.rc0.400.g1c36432dff-goog