netdev - Re: [PATCH v3 net 1/5] tcp: feed correct number of pkts acked to cc modules also in recovery

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAK6E8=ffjLxqYuQf0FMge9XjNbrFvPzE9NvM5ahJxhxAwZwsOA@mail.gmail.com>
Date:   Wed, 28 Mar 2018 08:04:54 -0700
From:   Yuchung Cheng <ycheng@...gle.com>
To:     Ilpo Järvinen <ilpo.jarvinen@...sinki.fi>
Cc:     netdev <netdev@...r.kernel.org>,
        Neal Cardwell <ncardwell@...gle.com>,
        Eric Dumazet <edumazet@...gle.com>,
        Sergei Shtylyov <sergei.shtylyov@...entembedded.com>
Subject: Re: [PATCH v3 net 1/5] tcp: feed correct number of pkts acked to cc
 modules also in recovery

On Wed, Mar 28, 2018 at 7:14 AM, Yuchung Cheng <ycheng@...gle.com> wrote:
>
> On Wed, Mar 28, 2018 at 5:45 AM, Ilpo Järvinen
> <ilpo.jarvinen@...sinki.fi> wrote:
> > On Tue, 27 Mar 2018, Yuchung Cheng wrote:
> >
> >> On Tue, Mar 27, 2018 at 7:23 AM, Ilpo Järvinen
> >> <ilpo.jarvinen@...sinki.fi> wrote:
> >> > On Mon, 26 Mar 2018, Yuchung Cheng wrote:
> >> >
> >> >> On Tue, Mar 13, 2018 at 3:25 AM, Ilpo Järvinen
> >> >> <ilpo.jarvinen@...sinki.fi> wrote:
> >> >> >
> >> >> > A miscalculation for the number of acknowledged packets occurs during
> >> >> > RTO recovery whenever SACK is not enabled and a cumulative ACK covers
> >> >> > any non-retransmitted skbs. The reason is that pkts_acked value
> >> >> > calculated in tcp_clean_rtx_queue is not correct for slow start after
> >> >> > RTO as it may include segments that were not lost and therefore did
> >> >> > not need retransmissions in the slow start following the RTO. Then
> >> >> > tcp_slow_start will add the excess into cwnd bloating it and
> >> >> > triggering a burst.
> >> >> >
> >> >> > Instead, we want to pass only the number of retransmitted segments
> >> >> > that were covered by the cumulative ACK (and potentially newly sent
> >> >> > data segments too if the cumulative ACK covers that far).
> >> >> >
> >> >> > Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@...sinki.fi>
> >> >> > ---
> >> >>
> >> >> My understanding is there are two problems
> >> >>
> >> >> 1) your fix: the reordering logic in tcp-remove_reno_sacks requires
> >> >> precise cumulatively acked count, not newly acked count?
> >> >
> >> > While I'm not entirely sure if you intented to say that my fix is broken
> >> > or not, I thought this very difference alot while making the fix and I
> >> > believe that this fix is needed because of the discontinuity at RTO
> >> > (sacked_out is cleared as we set L-bits + lost_out). This is an artifact
> >> > in the imitation of sacked_out for non-SACK but at RTO we can't keep that
> >> > in sync because we set L-bits (and have no S-bits to guide us). Thus, we
> >> > cannot anymore "use" those skbs with only L-bit for the reno_sacks logic.
> >> >
> >> > In tcp_remove_reno_sacks acked - sacked_out is being used to calculate
> >> > tp->delivered, using plain cumulative acked causes congestion control
> >> > breakage later as call to tcp_cong_control will directly use the
> >> > difference in tp->delivered.
> >> >
> >> > This boils down the exact definition of tp->delivered (the one given in
> >> > the header is not detailed enough). I guess you might have better idea
> >> > what it exactly is since one of you has added it? There are subtle things
> >> > in the defination that can make it entirely unsuitable for cc decisions.
> >> > Should those segments that we (possibly) already counted into
> >> > tp->delivered during (potentially preceeding) CA_Recovery be added to it
> >> > for _second time_ or not? This fix avoids such double counting (the
> >> Where is the double counting, assuming normal DUPACK behavior?
> >>
> >> In the non-sack case:
> >>
> >> 1. upon receiving a DUPACK, we assume one packet has been delivered by
> >> incrementing tp->delivered in tcp_add_reno_sack()
> >
> > 1b. RTO here. We clear tp->sacked_out at RTO (i.e., the discontinuity
> > I've tried to point out quite many times already)...
> >
> >> 2. upon receiving a partial ACK or an ACK that acks recovery point
> >> (high_seq), tp->delivered is incremented by (cumulatively acked -
> >> #dupacks) in tcp_remove_reno_sacks()
> >
> > ...and this won't happen correctly anymore after RTO (since non-SACK
> > won't keep #dupacks due to the discontinuity). Thus we end up adding
> > cumulatively acked - 0 to tp->delivered on those ACKs.
> >
> >> therefore tp->delivered is tracking the # of packets delivered
> >> (sacked, acked, DUPACK'd) with the most information it could have
> >> inferred.
> >
> > Since you didn't answer any of my questions about tp->delivered directly,
> > let me rephrase them to this example (non-SACK, of course):
> >
> > 4 segments outstanding. RTO recovery underway (lost_out=4, sacked_out=0).
> > Cwnd = 2 so the sender rexmits 2 out of 4. We get cumulative ACK for
> > three segments. How much should tp->delivered be incremented? 2 or 3?
> >
> > ...I think 2 is the right answer.
> >
> >> From congestion control's perspective, it cares about the delivery
> >> information (e.g. how much), not the sequences (what or how).
> >
> > I guess you must have missed my point. I'm talking about "how much"
> > whole the time. It's just about when can we account for it (and when not).
> >
> >> I am pointing out that
> >>
> >> 1) your fix may fix one corner CC packet accounting issue in the
> >> non-SACK and CA_Loss
> >> 2) but it does not fix the other (major) CC packet accounting issue in
> >> the SACK case
> >
> > You say major but it's major if and only if one is using one of the
> > affected cc modules which were not that many and IMHO not very mainstream
> > anyway (check my list from the previous email, not that I'd mind if they'd
> > get fixed too). The rest of the cc modules do not seem to use the incorrect
> > value even if some of them have the pkts_acked callback.
> >
> > Other CC packet accounting (regardless of SACK) is based on tp->delivered
> > and tp->delivered is NOT similarly miscounted with SACK because the logic
> > depend on !TCPCB_SACKED_ACKED (that fails only if we have very high ACK
> > loss).
> >
> >> 3) it breaks the dynamic dupthresh / reordering in the non-SACK case
> >> as tcp_check_reno_reordering requires strictly cum. ack.
> >
> > I think that the current non-SACK reordering detection logic is not that
> > sound after RTO (but I'm yet to prove this). ...There seems to be some
> > failure modes which is why I even thought of disabling the whole thing
> > for non-SACK RTO recoveries and as it now seems you do care more about
> > non-SACK than you initial claimed, I might even have motivation to fix
> > more those corners rather than the worst bugs only ;-). ...But I'll make
> > the tp->delivered fix only in this patch to avoid any change this part of
> > the code.
> >
> >> Therefore I prefer
> >> a) using tp->delivered to fix (1)(2)
> >> b) perhaps improving tp->delivered SACK emulation code in the non-SACK case
> >
> > Somehow I get an impression that you might assume/say here that
> [resending in plaintext]
> That's wrong impression. Perhaps it's worth re-iterating what I agree
> and disagree
>
> 1. [agree] there's accounting issue in non-SACK as you discovered
> which causes CC misbehavior
>
> 2. [major disagree] adjusting pkts_acked for ca_ops->pkts_acked in non-sack
>     => that field is not used by common C.C. (you said so too)
>
> 3. [disagree] adjusting pkts_acked may not affect reordering
> accounting in non-sack
>
>
> For cubic or reno, the main source is the "acked_sacked" passed into
> tcp_cong_avoid(). that variable is derived from tp->delivered.
> Therefore we need to fix that to address the problem in (1)
>
> I have yet to read your code. Will read later today.
Your patch looks good. Some questions / comments:

1. Why only apply to CA_Loss and not also CA_Recovery? this may
mitigate GRO issue I noted in other thread.

2. minor style nit: can we adjust tp->delivered directly in
tcp_clean_rtx_queue(). I am personally against calling "non-sack" reno
(e.g. tcp_*reno*()) because its really confusing w/ Reno congestion
control when SACK is used. One day I'd like to rename all these *reno*
to _nonsack_.

Thanks

>
> > tp->delivered is all correct for non-SACK? ...It isn't without a fix.
> > Therefore a) won't help any for non-SACK. Tp->delivered for non-SACK is
> > currently (mis!)calculated in tcp_remove_reno_sacks because the incorrect
> > pkts_acked is fed to it. ...Thus, b) is an intermediate step in the
> > miscalculation chain I'm fixing with this change. B) resolves also (1)
> > without additional changes to logic!
> >
> > I've included below the updated change that only fixes tp->delivered
> > calculation (not tested besides compiling). ...But I think it's worse than
> > my previous version because tcp_remove_reno_sacks then assumes
> > non-sensical L|S skbs (but there seem to be other, worse limitations in
> > sacked_out imitation after RTO because we've all skbs marked with L-bit so
> > it's not that big deal for me as most of the that imitation code is no-op
> > anyway after RTO).
> >
> >
> > diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
> > index 9a1b3c1..e11748d 100644
> > --- a/net/ipv4/tcp_input.c
> > +++ b/net/ipv4/tcp_input.c
> > @@ -1868,7 +1868,6 @@ static void tcp_remove_reno_sacks(struct sock *sk, int acked)
> >
> >         if (acked > 0) {
> >                 /* One ACK acked hole. The rest eat duplicate ACKs. */
> > -               tp->delivered += max_t(int, acked - tp->sacked_out, 1);
> >                 if (acked - 1 >= tp->sacked_out)
> >                         tp->sacked_out = 0;
> >                 else
> > @@ -1878,6 +1877,12 @@ static void tcp_remove_reno_sacks(struct sock *sk, int acked)
> >         tcp_verify_left_out(tp);
> >  }
> >
> > +static void tcp_update_reno_delivered(struct tcp_sock *tp, int acked)
> > +{
> > +       if (acked > 0)
> > +               tp->delivered += max_t(int, acked - tp->sacked_out, 1);
> > +}
> > +
> >  static inline void tcp_reset_reno_sack(struct tcp_sock *tp)
> >  {
> >         tp->sacked_out = 0;
> > @@ -3027,6 +3032,8 @@ static int tcp_clean_rtx_queue(struct sock *sk, u32 prior_fack,
> >         long seq_rtt_us = -1L;
> >         long ca_rtt_us = -1L;
> >         u32 pkts_acked = 0;
> > +       u32 rexmit_acked = 0;
> > +       u32 newdata_acked = 0;
> >         u32 last_in_flight = 0;
> >         bool rtt_update;
> >         int flag = 0;
> > @@ -3056,8 +3063,10 @@ static int tcp_clean_rtx_queue(struct sock *sk, u32 prior_fack,
> >                 }
> >
> >                 if (unlikely(sacked & TCPCB_RETRANS)) {
> > -                       if (sacked & TCPCB_SACKED_RETRANS)
> > +                       if (sacked & TCPCB_SACKED_RETRANS) {
> >                                 tp->retrans_out -= acked_pcount;
> > +                               rexmit_acked += acked_pcount;
> > +                       }
> >                         flag |= FLAG_RETRANS_DATA_ACKED;
> >                 } else if (!(sacked & TCPCB_SACKED_ACKED)) {
> >                         last_ackt = skb->skb_mstamp;
> > @@ -3070,6 +3079,8 @@ static int tcp_clean_rtx_queue(struct sock *sk, u32 prior_fack,
> >                                 reord = start_seq;
> >                         if (!after(scb->end_seq, tp->high_seq))
> >                                 flag |= FLAG_ORIG_SACK_ACKED;
> > +                       else
> > +                               newdata_acked += acked_pcount;
> >                 }
> >
> >                 if (sacked & TCPCB_SACKED_ACKED) {
> > @@ -3151,6 +3162,9 @@ static int tcp_clean_rtx_queue(struct sock *sk, u32 prior_fack,
> >                 }
> >
> >                 if (tcp_is_reno(tp)) {
> > +                       tcp_update_reno_delivered(tp, icsk->icsk_ca_state != TCP_CA_Loss ?
> > +                                                     pkts_acked :
> > +                                                     rexmit_acked + newdata_acked);
> >                         tcp_remove_reno_sacks(sk, pkts_acked);
> >                 } else {
> >                         int delta;