netdev - Re: [PATCH v3 net 1/5] tcp: feed correct number of pkts acked to cc modules also in recovery

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAK6E8=f77S5VjSofC6v8EY2U+YQ0incyv7beKA-aP2p9c7ebUg@mail.gmail.com>
Date:   Wed, 28 Mar 2018 07:14:25 -0700
From:   Yuchung Cheng <ycheng@...gle.com>
To:     Ilpo Järvinen <ilpo.jarvinen@...sinki.fi>
Cc:     netdev <netdev@...r.kernel.org>,
        Neal Cardwell <ncardwell@...gle.com>,
        Eric Dumazet <edumazet@...gle.com>,
        Sergei Shtylyov <sergei.shtylyov@...entembedded.com>
Subject: Re: [PATCH v3 net 1/5] tcp: feed correct number of pkts acked to cc
 modules also in recovery

On Wed, Mar 28, 2018 at 5:45 AM, Ilpo Järvinen
<ilpo.jarvinen@...sinki.fi> wrote:
> On Tue, 27 Mar 2018, Yuchung Cheng wrote:
>
>> On Tue, Mar 27, 2018 at 7:23 AM, Ilpo Järvinen
>> <ilpo.jarvinen@...sinki.fi> wrote:
>> > On Mon, 26 Mar 2018, Yuchung Cheng wrote:
>> >
>> >> On Tue, Mar 13, 2018 at 3:25 AM, Ilpo Järvinen
>> >> <ilpo.jarvinen@...sinki.fi> wrote:
>> >> >
>> >> > A miscalculation for the number of acknowledged packets occurs during
>> >> > RTO recovery whenever SACK is not enabled and a cumulative ACK covers
>> >> > any non-retransmitted skbs. The reason is that pkts_acked value
>> >> > calculated in tcp_clean_rtx_queue is not correct for slow start after
>> >> > RTO as it may include segments that were not lost and therefore did
>> >> > not need retransmissions in the slow start following the RTO. Then
>> >> > tcp_slow_start will add the excess into cwnd bloating it and
>> >> > triggering a burst.
>> >> >
>> >> > Instead, we want to pass only the number of retransmitted segments
>> >> > that were covered by the cumulative ACK (and potentially newly sent
>> >> > data segments too if the cumulative ACK covers that far).
>> >> >
>> >> > Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@...sinki.fi>
>> >> > ---
>> >>
>> >> My understanding is there are two problems
>> >>
>> >> 1) your fix: the reordering logic in tcp-remove_reno_sacks requires
>> >> precise cumulatively acked count, not newly acked count?
>> >
>> > While I'm not entirely sure if you intented to say that my fix is broken
>> > or not, I thought this very difference alot while making the fix and I
>> > believe that this fix is needed because of the discontinuity at RTO
>> > (sacked_out is cleared as we set L-bits + lost_out). This is an artifact
>> > in the imitation of sacked_out for non-SACK but at RTO we can't keep that
>> > in sync because we set L-bits (and have no S-bits to guide us). Thus, we
>> > cannot anymore "use" those skbs with only L-bit for the reno_sacks logic.
>> >
>> > In tcp_remove_reno_sacks acked - sacked_out is being used to calculate
>> > tp->delivered, using plain cumulative acked causes congestion control
>> > breakage later as call to tcp_cong_control will directly use the
>> > difference in tp->delivered.
>> >
>> > This boils down the exact definition of tp->delivered (the one given in
>> > the header is not detailed enough). I guess you might have better idea
>> > what it exactly is since one of you has added it? There are subtle things
>> > in the defination that can make it entirely unsuitable for cc decisions.
>> > Should those segments that we (possibly) already counted into
>> > tp->delivered during (potentially preceeding) CA_Recovery be added to it
>> > for _second time_ or not? This fix avoids such double counting (the
>> Where is the double counting, assuming normal DUPACK behavior?
>>
>> In the non-sack case:
>>
>> 1. upon receiving a DUPACK, we assume one packet has been delivered by
>> incrementing tp->delivered in tcp_add_reno_sack()
>
> 1b. RTO here. We clear tp->sacked_out at RTO (i.e., the discontinuity
> I've tried to point out quite many times already)...
>
>> 2. upon receiving a partial ACK or an ACK that acks recovery point
>> (high_seq), tp->delivered is incremented by (cumulatively acked -
>> #dupacks) in tcp_remove_reno_sacks()
>
> ...and this won't happen correctly anymore after RTO (since non-SACK
> won't keep #dupacks due to the discontinuity). Thus we end up adding
> cumulatively acked - 0 to tp->delivered on those ACKs.
>
>> therefore tp->delivered is tracking the # of packets delivered
>> (sacked, acked, DUPACK'd) with the most information it could have
>> inferred.
>
> Since you didn't answer any of my questions about tp->delivered directly,
> let me rephrase them to this example (non-SACK, of course):
>
> 4 segments outstanding. RTO recovery underway (lost_out=4, sacked_out=0).
> Cwnd = 2 so the sender rexmits 2 out of 4. We get cumulative ACK for
> three segments. How much should tp->delivered be incremented? 2 or 3?
>
> ...I think 2 is the right answer.
>
>> From congestion control's perspective, it cares about the delivery
>> information (e.g. how much), not the sequences (what or how).
>
> I guess you must have missed my point. I'm talking about "how much"
> whole the time. It's just about when can we account for it (and when not).
>
>> I am pointing out that
>>
>> 1) your fix may fix one corner CC packet accounting issue in the
>> non-SACK and CA_Loss
>> 2) but it does not fix the other (major) CC packet accounting issue in
>> the SACK case
>
> You say major but it's major if and only if one is using one of the
> affected cc modules which were not that many and IMHO not very mainstream
> anyway (check my list from the previous email, not that I'd mind if they'd
> get fixed too). The rest of the cc modules do not seem to use the incorrect
> value even if some of them have the pkts_acked callback.
>
> Other CC packet accounting (regardless of SACK) is based on tp->delivered
> and tp->delivered is NOT similarly miscounted with SACK because the logic
> depend on !TCPCB_SACKED_ACKED (that fails only if we have very high ACK
> loss).
>
>> 3) it breaks the dynamic dupthresh / reordering in the non-SACK case
>> as tcp_check_reno_reordering requires strictly cum. ack.
>
> I think that the current non-SACK reordering detection logic is not that
> sound after RTO (but I'm yet to prove this). ...There seems to be some
> failure modes which is why I even thought of disabling the whole thing
> for non-SACK RTO recoveries and as it now seems you do care more about
> non-SACK than you initial claimed, I might even have motivation to fix
> more those corners rather than the worst bugs only ;-). ...But I'll make
> the tp->delivered fix only in this patch to avoid any change this part of
> the code.
>
>> Therefore I prefer
>> a) using tp->delivered to fix (1)(2)
>> b) perhaps improving tp->delivered SACK emulation code in the non-SACK case
>
> Somehow I get an impression that you might assume/say here that
[resending in plaintext]
That's wrong impression. Perhaps it's worth re-iterating what I agree
and disagree

1. [agree] there's accounting issue in non-SACK as you discovered
which causes CC misbehavior

2. [major disagree] adjusting pkts_acked for ca_ops->pkts_acked in non-sack
    => that field is not used by common C.C. (you said so too)

3. [disagree] adjusting pkts_acked may not affect reordering
accounting in non-sack


For cubic or reno, the main source is the "acked_sacked" passed into
tcp_cong_avoid(). that variable is derived from tp->delivered.
Therefore we need to fix that to address the problem in (1)

I have yet to read your code. Will read later today.

> tp->delivered is all correct for non-SACK? ...It isn't without a fix.
> Therefore a) won't help any for non-SACK. Tp->delivered for non-SACK is
> currently (mis!)calculated in tcp_remove_reno_sacks because the incorrect
> pkts_acked is fed to it. ...Thus, b) is an intermediate step in the
> miscalculation chain I'm fixing with this change. B) resolves also (1)
> without additional changes to logic!
>
> I've included below the updated change that only fixes tp->delivered
> calculation (not tested besides compiling). ...But I think it's worse than
> my previous version because tcp_remove_reno_sacks then assumes
> non-sensical L|S skbs (but there seem to be other, worse limitations in
> sacked_out imitation after RTO because we've all skbs marked with L-bit so
> it's not that big deal for me as most of the that imitation code is no-op
> anyway after RTO).
>
>
> diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
> index 9a1b3c1..e11748d 100644
> --- a/net/ipv4/tcp_input.c
> +++ b/net/ipv4/tcp_input.c
> @@ -1868,7 +1868,6 @@ static void tcp_remove_reno_sacks(struct sock *sk, int acked)
>
>         if (acked > 0) {
>                 /* One ACK acked hole. The rest eat duplicate ACKs. */
> -               tp->delivered += max_t(int, acked - tp->sacked_out, 1);
>                 if (acked - 1 >= tp->sacked_out)
>                         tp->sacked_out = 0;
>                 else
> @@ -1878,6 +1877,12 @@ static void tcp_remove_reno_sacks(struct sock *sk, int acked)
>         tcp_verify_left_out(tp);
>  }
>
> +static void tcp_update_reno_delivered(struct tcp_sock *tp, int acked)
> +{
> +       if (acked > 0)
> +               tp->delivered += max_t(int, acked - tp->sacked_out, 1);
> +}
> +
>  static inline void tcp_reset_reno_sack(struct tcp_sock *tp)
>  {
>         tp->sacked_out = 0;
> @@ -3027,6 +3032,8 @@ static int tcp_clean_rtx_queue(struct sock *sk, u32 prior_fack,
>         long seq_rtt_us = -1L;
>         long ca_rtt_us = -1L;
>         u32 pkts_acked = 0;
> +       u32 rexmit_acked = 0;
> +       u32 newdata_acked = 0;
>         u32 last_in_flight = 0;
>         bool rtt_update;
>         int flag = 0;
> @@ -3056,8 +3063,10 @@ static int tcp_clean_rtx_queue(struct sock *sk, u32 prior_fack,
>                 }
>
>                 if (unlikely(sacked & TCPCB_RETRANS)) {
> -                       if (sacked & TCPCB_SACKED_RETRANS)
> +                       if (sacked & TCPCB_SACKED_RETRANS) {
>                                 tp->retrans_out -= acked_pcount;
> +                               rexmit_acked += acked_pcount;
> +                       }
>                         flag |= FLAG_RETRANS_DATA_ACKED;
>                 } else if (!(sacked & TCPCB_SACKED_ACKED)) {
>                         last_ackt = skb->skb_mstamp;
> @@ -3070,6 +3079,8 @@ static int tcp_clean_rtx_queue(struct sock *sk, u32 prior_fack,
>                                 reord = start_seq;
>                         if (!after(scb->end_seq, tp->high_seq))
>                                 flag |= FLAG_ORIG_SACK_ACKED;
> +                       else
> +                               newdata_acked += acked_pcount;
>                 }
>
>                 if (sacked & TCPCB_SACKED_ACKED) {
> @@ -3151,6 +3162,9 @@ static int tcp_clean_rtx_queue(struct sock *sk, u32 prior_fack,
>                 }
>
>                 if (tcp_is_reno(tp)) {
> +                       tcp_update_reno_delivered(tp, icsk->icsk_ca_state != TCP_CA_Loss ?
> +                                                     pkts_acked :
> +                                                     rexmit_acked + newdata_acked);
>                         tcp_remove_reno_sacks(sk, pkts_acked);
>                 } else {
>                         int delta;