netdev - Re: [PATCH v3 net 1/5] tcp: feed correct number of pkts acked to cc modules also in recovery

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:   Tue, 27 Mar 2018 08:06:52 -0700
From:   Yuchung Cheng <ycheng@...gle.com>
To:     Ilpo Järvinen <ilpo.jarvinen@...sinki.fi>
Cc:     netdev <netdev@...r.kernel.org>,
        Neal Cardwell <ncardwell@...gle.com>,
        Eric Dumazet <edumazet@...gle.com>,
        Sergei Shtylyov <sergei.shtylyov@...entembedded.com>
Subject: Re: [PATCH v3 net 1/5] tcp: feed correct number of pkts acked to cc
 modules also in recovery

On Tue, Mar 27, 2018 at 7:23 AM, Ilpo Järvinen
<ilpo.jarvinen@...sinki.fi> wrote:
> On Mon, 26 Mar 2018, Yuchung Cheng wrote:
>
>> On Tue, Mar 13, 2018 at 3:25 AM, Ilpo Järvinen
>> <ilpo.jarvinen@...sinki.fi> wrote:
>> >
>> > A miscalculation for the number of acknowledged packets occurs during
>> > RTO recovery whenever SACK is not enabled and a cumulative ACK covers
>> > any non-retransmitted skbs. The reason is that pkts_acked value
>> > calculated in tcp_clean_rtx_queue is not correct for slow start after
>> > RTO as it may include segments that were not lost and therefore did
>> > not need retransmissions in the slow start following the RTO. Then
>> > tcp_slow_start will add the excess into cwnd bloating it and
>> > triggering a burst.
>> >
>> > Instead, we want to pass only the number of retransmitted segments
>> > that were covered by the cumulative ACK (and potentially newly sent
>> > data segments too if the cumulative ACK covers that far).
>> >
>> > Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@...sinki.fi>
>> > ---
>> >  net/ipv4/tcp_input.c | 16 +++++++++++++++-
>> >  1 file changed, 15 insertions(+), 1 deletion(-)
>> >
>> > diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
>> > index 9a1b3c1..4a26c09 100644
>> > --- a/net/ipv4/tcp_input.c
>> > +++ b/net/ipv4/tcp_input.c
>> > @@ -3027,6 +3027,8 @@ static int tcp_clean_rtx_queue(struct sock *sk, u32 prior_fack,
>> >         long seq_rtt_us = -1L;
>> >         long ca_rtt_us = -1L;
>> >         u32 pkts_acked = 0;
>> > +       u32 rexmit_acked = 0;
>> > +       u32 newdata_acked = 0;
>> >         u32 last_in_flight = 0;
>> >         bool rtt_update;
>> >         int flag = 0;
>> > @@ -3056,8 +3058,10 @@ static int tcp_clean_rtx_queue(struct sock *sk, u32 prior_fack,
>> >                 }
>> >
>> >                 if (unlikely(sacked & TCPCB_RETRANS)) {
>> > -                       if (sacked & TCPCB_SACKED_RETRANS)
>> > +                       if (sacked & TCPCB_SACKED_RETRANS) {
>> >                                 tp->retrans_out -= acked_pcount;
>> > +                               rexmit_acked += acked_pcount;
>> > +                       }
>> >                         flag |= FLAG_RETRANS_DATA_ACKED;
>> >                 } else if (!(sacked & TCPCB_SACKED_ACKED)) {
>> >                         last_ackt = skb->skb_mstamp;
>> > @@ -3070,6 +3074,8 @@ static int tcp_clean_rtx_queue(struct sock *sk, u32 prior_fack,
>> >                                 reord = start_seq;
>> >                         if (!after(scb->end_seq, tp->high_seq))
>> >                                 flag |= FLAG_ORIG_SACK_ACKED;
>> > +                       else
>> > +                               newdata_acked += acked_pcount;
>> >                 }
>> >
>> >                 if (sacked & TCPCB_SACKED_ACKED) {
>> > @@ -3151,6 +3157,14 @@ static int tcp_clean_rtx_queue(struct sock *sk, u32 prior_fack,
>> >                 }
>> >
>> >                 if (tcp_is_reno(tp)) {
>> > +                       /* Due to discontinuity on RTO in the artificial
>> > +                        * sacked_out calculations, TCP must restrict
>> > +                        * pkts_acked without SACK to rexmits and new data
>> > +                        * segments
>> > +                        */
>> > +                       if (icsk->icsk_ca_state == TCP_CA_Loss)
>> > +                               pkts_acked = rexmit_acked + newdata_acked;
>> > +
>> My understanding is there are two problems
>>
>> 1) your fix: the reordering logic in tcp-remove_reno_sacks requires
>> precise cumulatively acked count, not newly acked count?
>
> While I'm not entirely sure if you intented to say that my fix is broken
> or not, I thought this very difference alot while making the fix and I
> believe that this fix is needed because of the discontinuity at RTO
> (sacked_out is cleared as we set L-bits + lost_out). This is an artifact
> in the imitation of sacked_out for non-SACK but at RTO we can't keep that
> in sync because we set L-bits (and have no S-bits to guide us). Thus, we
> cannot anymore "use" those skbs with only L-bit for the reno_sacks logic.
>
> In tcp_remove_reno_sacks acked - sacked_out is being used to calculate
> tp->delivered, using plain cumulative acked causes congestion control
> breakage later as call to tcp_cong_control will directly use the
> difference in tp->delivered.
>
> This boils down the exact definition of tp->delivered (the one given in
> the header is not detailed enough). I guess you might have better idea
> what it exactly is since one of you has added it? There are subtle things
> in the defination that can make it entirely unsuitable for cc decisions.
> Should those segments that we (possibly) already counted into
> tp->delivered during (potentially preceeding) CA_Recovery be added to it
> for _second time_ or not? This fix avoids such double counting (the
Where is the double counting, assuming normal DUPACK behavior?

In the non-sack case:

1. upon receiving a DUPACK, we assume one packet has been delivered by
incrementing tp->delivered in tcp_add_reno_sack()
2. upon receiving a partial ACK or an ACK that acks recovery point
(high_seq), tp->delivered is incremented by (cumulatively acked -
#dupacks) in tcp_remove_reno_sacks()

therefore tp->delivered is tracking the # of packets delivered
(sacked, acked, DUPACK'd) with the most information it could have
inferred.

>From congestion control's perspective, it cares about the delivery
information (e.g. how much), not the sequences (what or how).

I am pointing out that

1) your fix may fix one corner CC packet accounting issue in the
non-SACK and CA_Loss
2) but it does not fix the other (major) CC packet accounting issue in
the SACK case
3) it breaks the dynamic dupthresh / reordering in the non-SACK case
as tcp_check_reno_reordering requires strictly cum. ack.

Therefore I prefer
a) using tp->delivered to fix (1)(2)
b) perhaps improving tp->delivered SACK emulation code in the non-SACK case


> price is that we might underestimate). For SACK+ACK losses, the similar
> question is: Should those segments that we missed counting into
> tp->delivered during preceeding CA_Recovery (due to losing enough SACKs)
> be added into tp->delivered now during RTO recovery or not (I'm not
> proposing we fix this unless we want to fix the both issues at the same
> time here as its impact with SACK is not that significant)? Is
> tp->delivered supposed to under- or overestimate (in the cases we're not
> sure what/when something happened)? ...If it's overestimating under any
> circumstances (for the current ACK), we cannot base our cc decision on it.
>
> tcp_check_reno_reordering might like to have the cumulatively acked count
> but due to the forementioned discontinuity we anyway cannot accurately
> provide in CA_Loss what tcp_limit_reno_sacked expects (and the
> other CA states are unaffected by this fix). While we could call
> tcp_check_reno_reordering directly from tcp_clean_rtx_queue, it wouldn't
> remove the fundamental discontinuity problem that we have for what
> tcp_limit_reno_sacked assumes about sacked_out. It might actually be so
> that tcp_limit_reno_sacked is just never going to work after we zero
> tp->sacked_out (this would actually be the problem #3) or at least ends
> up underestimating the reordering degree by the amount we cleared from
> sacked_out at RTO?
>
>> 2) current code: pkts_acked can substantially over-estimate the newly
>> delivered pkts in both SACK and non-SACK cases. For example, let's say
>> 99/100 packets are already sacked, and the next ACK acks 100 pkts.
>> pkts_acked == 100 but really only one packet is delivered. It's wrong
>> to inform congestion control that 100 packets have just delivered.
>> AFAICT, the CCs that have pkts_acked callbacks all treat pkts_acked as
>> the newly delivered packets.
>>
>> A better fix for both SACK and non-SACK, seems to be moving
>> ca_ops->pkts_acked into tcp_cong_control, where the "acked_sacked" is
>> calibrated? this is what BBR is currently doing to avoid these pitfalls.
>
> Unfortunately that would not fix the problematic calculation of
> tp->delivered.
>
> But you might be right that there's a problem #2 (it's hard to notice
> for real though as most of the cc modules don't seem to use it for
> anything). However, this is for pkts_acked so are you sure what the
> cc modules exactly expect: the shrinkage in # of outstanding packets or
> the number of newly delivered packets?
>
> Looking (only) quickly into the modules (that use it other than in
> CA_Open):
> - Somewhat suspicious delayed ACK detection logic in cdg
> - HTCP might want shrinkage in # of outstanding packets (but I'm not
> entirely sure) for throughput measurement
> - I guess TCP illinois is broken (assumes it's newly delivered packets)
> but it would not need to use it at all as it just proxies the value in
> a local variable into tcp_illinois_cong_avoid that has the correct acked
> readily available.
>
> There are separate cong_avoid and cong_control callbacks that are more
> obviously oriented for doing cc where as I'd think pkts_acked callback
> seems more oriented to for RTT-sample related operations. Therefore, I'm
> not entirely sure I this is what is wanted for pkts_acked, especially
> given the HTCP example.
>
>
> --
>  i.