netdev - Re: [PATCH v3 net 1/5] tcp: feed correct number of pkts acked to cc modules also in recovery

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <alpine.DEB.2.20.1803272047260.8966@whs-18.cs.helsinki.fi>
Date:   Wed, 28 Mar 2018 15:45:57 +0300 (EEST)
From:   Ilpo Järvinen <ilpo.jarvinen@...sinki.fi>
To:     Yuchung Cheng <ycheng@...gle.com>
cc:     netdev <netdev@...r.kernel.org>,
        Neal Cardwell <ncardwell@...gle.com>,
        Eric Dumazet <edumazet@...gle.com>,
        Sergei Shtylyov <sergei.shtylyov@...entembedded.com>
Subject: Re: [PATCH v3 net 1/5] tcp: feed correct number of pkts acked to cc
 modules also in recovery

On Tue, 27 Mar 2018, Yuchung Cheng wrote:

> On Tue, Mar 27, 2018 at 7:23 AM, Ilpo Järvinen
> <ilpo.jarvinen@...sinki.fi> wrote:
> > On Mon, 26 Mar 2018, Yuchung Cheng wrote:
> >
> >> On Tue, Mar 13, 2018 at 3:25 AM, Ilpo Järvinen
> >> <ilpo.jarvinen@...sinki.fi> wrote:
> >> >
> >> > A miscalculation for the number of acknowledged packets occurs during
> >> > RTO recovery whenever SACK is not enabled and a cumulative ACK covers
> >> > any non-retransmitted skbs. The reason is that pkts_acked value
> >> > calculated in tcp_clean_rtx_queue is not correct for slow start after
> >> > RTO as it may include segments that were not lost and therefore did
> >> > not need retransmissions in the slow start following the RTO. Then
> >> > tcp_slow_start will add the excess into cwnd bloating it and
> >> > triggering a burst.
> >> >
> >> > Instead, we want to pass only the number of retransmitted segments
> >> > that were covered by the cumulative ACK (and potentially newly sent
> >> > data segments too if the cumulative ACK covers that far).
> >> >
> >> > Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@...sinki.fi>
> >> > ---
> >>
> >> My understanding is there are two problems
> >>
> >> 1) your fix: the reordering logic in tcp-remove_reno_sacks requires
> >> precise cumulatively acked count, not newly acked count?
> >
> > While I'm not entirely sure if you intented to say that my fix is broken
> > or not, I thought this very difference alot while making the fix and I
> > believe that this fix is needed because of the discontinuity at RTO
> > (sacked_out is cleared as we set L-bits + lost_out). This is an artifact
> > in the imitation of sacked_out for non-SACK but at RTO we can't keep that
> > in sync because we set L-bits (and have no S-bits to guide us). Thus, we
> > cannot anymore "use" those skbs with only L-bit for the reno_sacks logic.
> >
> > In tcp_remove_reno_sacks acked - sacked_out is being used to calculate
> > tp->delivered, using plain cumulative acked causes congestion control
> > breakage later as call to tcp_cong_control will directly use the
> > difference in tp->delivered.
> >
> > This boils down the exact definition of tp->delivered (the one given in
> > the header is not detailed enough). I guess you might have better idea
> > what it exactly is since one of you has added it? There are subtle things
> > in the defination that can make it entirely unsuitable for cc decisions.
> > Should those segments that we (possibly) already counted into
> > tp->delivered during (potentially preceeding) CA_Recovery be added to it
> > for _second time_ or not? This fix avoids such double counting (the
> Where is the double counting, assuming normal DUPACK behavior?
> 
> In the non-sack case:
> 
> 1. upon receiving a DUPACK, we assume one packet has been delivered by
> incrementing tp->delivered in tcp_add_reno_sack()

1b. RTO here. We clear tp->sacked_out at RTO (i.e., the discontinuity 
I've tried to point out quite many times already)...

> 2. upon receiving a partial ACK or an ACK that acks recovery point
> (high_seq), tp->delivered is incremented by (cumulatively acked -
> #dupacks) in tcp_remove_reno_sacks()

...and this won't happen correctly anymore after RTO (since non-SACK 
won't keep #dupacks due to the discontinuity). Thus we end up adding
cumulatively acked - 0 to tp->delivered on those ACKs.

> therefore tp->delivered is tracking the # of packets delivered
> (sacked, acked, DUPACK'd) with the most information it could have
> inferred.

Since you didn't answer any of my questions about tp->delivered directly, 
let me rephrase them to this example (non-SACK, of course):

4 segments outstanding. RTO recovery underway (lost_out=4, sacked_out=0).
Cwnd = 2 so the sender rexmits 2 out of 4. We get cumulative ACK for 
three segments. How much should tp->delivered be incremented? 2 or 3?

...I think 2 is the right answer.

> From congestion control's perspective, it cares about the delivery
> information (e.g. how much), not the sequences (what or how).

I guess you must have missed my point. I'm talking about "how much" 
whole the time. It's just about when can we account for it (and when not).

> I am pointing out that
> 
> 1) your fix may fix one corner CC packet accounting issue in the
> non-SACK and CA_Loss
> 2) but it does not fix the other (major) CC packet accounting issue in
> the SACK case

You say major but it's major if and only if one is using one of the 
affected cc modules which were not that many and IMHO not very mainstream 
anyway (check my list from the previous email, not that I'd mind if they'd 
get fixed too). The rest of the cc modules do not seem to use the incorrect
value even if some of them have the pkts_acked callback.

Other CC packet accounting (regardless of SACK) is based on tp->delivered 
and tp->delivered is NOT similarly miscounted with SACK because the logic 
depend on !TCPCB_SACKED_ACKED (that fails only if we have very high ACK 
loss).

> 3) it breaks the dynamic dupthresh / reordering in the non-SACK case
> as tcp_check_reno_reordering requires strictly cum. ack.

I think that the current non-SACK reordering detection logic is not that 
sound after RTO (but I'm yet to prove this). ...There seems to be some 
failure modes which is why I even thought of disabling the whole thing
for non-SACK RTO recoveries and as it now seems you do care more about 
non-SACK than you initial claimed, I might even have motivation to fix 
more those corners rather than the worst bugs only ;-). ...But I'll make 
the tp->delivered fix only in this patch to avoid any change this part of 
the code.

> Therefore I prefer
> a) using tp->delivered to fix (1)(2)
> b) perhaps improving tp->delivered SACK emulation code in the non-SACK case

Somehow I get an impression that you might assume/say here that 
tp->delivered is all correct for non-SACK? ...It isn't without a fix.
Therefore a) won't help any for non-SACK. Tp->delivered for non-SACK is 
currently (mis!)calculated in tcp_remove_reno_sacks because the incorrect 
pkts_acked is fed to it. ...Thus, b) is an intermediate step in the 
miscalculation chain I'm fixing with this change. B) resolves also (1) 
without additional changes to logic!

I've included below the updated change that only fixes tp->delivered 
calculation (not tested besides compiling). ...But I think it's worse than 
my previous version because tcp_remove_reno_sacks then assumes 
non-sensical L|S skbs (but there seem to be other, worse limitations in 
sacked_out imitation after RTO because we've all skbs marked with L-bit so 
it's not that big deal for me as most of the that imitation code is no-op 
anyway after RTO).


diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 9a1b3c1..e11748d 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -1868,7 +1868,6 @@ static void tcp_remove_reno_sacks(struct sock *sk, int acked)
 
 	if (acked > 0) {
 		/* One ACK acked hole. The rest eat duplicate ACKs. */
-		tp->delivered += max_t(int, acked - tp->sacked_out, 1);
 		if (acked - 1 >= tp->sacked_out)
 			tp->sacked_out = 0;
 		else
@@ -1878,6 +1877,12 @@ static void tcp_remove_reno_sacks(struct sock *sk, int acked)
 	tcp_verify_left_out(tp);
 }
 
+static void tcp_update_reno_delivered(struct tcp_sock *tp, int acked)
+{
+	if (acked > 0)
+		tp->delivered += max_t(int, acked - tp->sacked_out, 1);
+}
+
 static inline void tcp_reset_reno_sack(struct tcp_sock *tp)
 {
 	tp->sacked_out = 0;
@@ -3027,6 +3032,8 @@ static int tcp_clean_rtx_queue(struct sock *sk, u32 prior_fack,
 	long seq_rtt_us = -1L;
 	long ca_rtt_us = -1L;
 	u32 pkts_acked = 0;
+	u32 rexmit_acked = 0;
+	u32 newdata_acked = 0;
 	u32 last_in_flight = 0;
 	bool rtt_update;
 	int flag = 0;
@@ -3056,8 +3063,10 @@ static int tcp_clean_rtx_queue(struct sock *sk, u32 prior_fack,
 		}
 
 		if (unlikely(sacked & TCPCB_RETRANS)) {
-			if (sacked & TCPCB_SACKED_RETRANS)
+			if (sacked & TCPCB_SACKED_RETRANS) {
 				tp->retrans_out -= acked_pcount;
+				rexmit_acked += acked_pcount;
+			}
 			flag |= FLAG_RETRANS_DATA_ACKED;
 		} else if (!(sacked & TCPCB_SACKED_ACKED)) {
 			last_ackt = skb->skb_mstamp;
@@ -3070,6 +3079,8 @@ static int tcp_clean_rtx_queue(struct sock *sk, u32 prior_fack,
 				reord = start_seq;
 			if (!after(scb->end_seq, tp->high_seq))
 				flag |= FLAG_ORIG_SACK_ACKED;
+			else
+				newdata_acked += acked_pcount;
 		}
 
 		if (sacked & TCPCB_SACKED_ACKED) {
@@ -3151,6 +3162,9 @@ static int tcp_clean_rtx_queue(struct sock *sk, u32 prior_fack,
 		}
 
 		if (tcp_is_reno(tp)) {
+			tcp_update_reno_delivered(tp, icsk->icsk_ca_state != TCP_CA_Loss ?
+						      pkts_acked :
+						      rexmit_acked + newdata_acked);
 			tcp_remove_reno_sacks(sk, pkts_acked);
 		} else {
 			int delta;