netdev - Re: [PATCH v3 net 1/5] tcp: feed correct number of pkts acked to cc modules also in recovery

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:   Tue, 27 Mar 2018 17:23:43 +0300 (EEST)
From:   Ilpo Järvinen <ilpo.jarvinen@...sinki.fi>
To:     Yuchung Cheng <ycheng@...gle.com>
cc:     netdev <netdev@...r.kernel.org>,
        Neal Cardwell <ncardwell@...gle.com>,
        Eric Dumazet <edumazet@...gle.com>,
        Sergei Shtylyov <sergei.shtylyov@...entembedded.com>
Subject: Re: [PATCH v3 net 1/5] tcp: feed correct number of pkts acked to cc
 modules also in recovery

On Mon, 26 Mar 2018, Yuchung Cheng wrote:

> On Tue, Mar 13, 2018 at 3:25 AM, Ilpo Järvinen
> <ilpo.jarvinen@...sinki.fi> wrote:
> >
> > A miscalculation for the number of acknowledged packets occurs during
> > RTO recovery whenever SACK is not enabled and a cumulative ACK covers
> > any non-retransmitted skbs. The reason is that pkts_acked value
> > calculated in tcp_clean_rtx_queue is not correct for slow start after
> > RTO as it may include segments that were not lost and therefore did
> > not need retransmissions in the slow start following the RTO. Then
> > tcp_slow_start will add the excess into cwnd bloating it and
> > triggering a burst.
> >
> > Instead, we want to pass only the number of retransmitted segments
> > that were covered by the cumulative ACK (and potentially newly sent
> > data segments too if the cumulative ACK covers that far).
> >
> > Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@...sinki.fi>
> > ---
> >  net/ipv4/tcp_input.c | 16 +++++++++++++++-
> >  1 file changed, 15 insertions(+), 1 deletion(-)
> >
> > diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
> > index 9a1b3c1..4a26c09 100644
> > --- a/net/ipv4/tcp_input.c
> > +++ b/net/ipv4/tcp_input.c
> > @@ -3027,6 +3027,8 @@ static int tcp_clean_rtx_queue(struct sock *sk, u32 prior_fack,
> >         long seq_rtt_us = -1L;
> >         long ca_rtt_us = -1L;
> >         u32 pkts_acked = 0;
> > +       u32 rexmit_acked = 0;
> > +       u32 newdata_acked = 0;
> >         u32 last_in_flight = 0;
> >         bool rtt_update;
> >         int flag = 0;
> > @@ -3056,8 +3058,10 @@ static int tcp_clean_rtx_queue(struct sock *sk, u32 prior_fack,
> >                 }
> >
> >                 if (unlikely(sacked & TCPCB_RETRANS)) {
> > -                       if (sacked & TCPCB_SACKED_RETRANS)
> > +                       if (sacked & TCPCB_SACKED_RETRANS) {
> >                                 tp->retrans_out -= acked_pcount;
> > +                               rexmit_acked += acked_pcount;
> > +                       }
> >                         flag |= FLAG_RETRANS_DATA_ACKED;
> >                 } else if (!(sacked & TCPCB_SACKED_ACKED)) {
> >                         last_ackt = skb->skb_mstamp;
> > @@ -3070,6 +3074,8 @@ static int tcp_clean_rtx_queue(struct sock *sk, u32 prior_fack,
> >                                 reord = start_seq;
> >                         if (!after(scb->end_seq, tp->high_seq))
> >                                 flag |= FLAG_ORIG_SACK_ACKED;
> > +                       else
> > +                               newdata_acked += acked_pcount;
> >                 }
> >
> >                 if (sacked & TCPCB_SACKED_ACKED) {
> > @@ -3151,6 +3157,14 @@ static int tcp_clean_rtx_queue(struct sock *sk, u32 prior_fack,
> >                 }
> >
> >                 if (tcp_is_reno(tp)) {
> > +                       /* Due to discontinuity on RTO in the artificial
> > +                        * sacked_out calculations, TCP must restrict
> > +                        * pkts_acked without SACK to rexmits and new data
> > +                        * segments
> > +                        */
> > +                       if (icsk->icsk_ca_state == TCP_CA_Loss)
> > +                               pkts_acked = rexmit_acked + newdata_acked;
> > +
> My understanding is there are two problems
> 
> 1) your fix: the reordering logic in tcp-remove_reno_sacks requires
> precise cumulatively acked count, not newly acked count?

While I'm not entirely sure if you intented to say that my fix is broken 
or not, I thought this very difference alot while making the fix and I 
believe that this fix is needed because of the discontinuity at RTO 
(sacked_out is cleared as we set L-bits + lost_out). This is an artifact 
in the imitation of sacked_out for non-SACK but at RTO we can't keep that 
in sync because we set L-bits (and have no S-bits to guide us). Thus, we 
cannot anymore "use" those skbs with only L-bit for the reno_sacks logic.

In tcp_remove_reno_sacks acked - sacked_out is being used to calculate 
tp->delivered, using plain cumulative acked causes congestion control 
breakage later as call to tcp_cong_control will directly use the 
difference in tp->delivered.

This boils down the exact definition of tp->delivered (the one given in 
the header is not detailed enough). I guess you might have better idea
what it exactly is since one of you has added it? There are subtle things 
in the defination that can make it entirely unsuitable for cc decisions. 
Should those segments that we (possibly) already counted into 
tp->delivered during (potentially preceeding) CA_Recovery be added to it 
for _second time_ or not? This fix avoids such double counting (the 
price is that we might underestimate). For SACK+ACK losses, the similar 
question is: Should those segments that we missed counting into 
tp->delivered during preceeding CA_Recovery (due to losing enough SACKs) 
be added into tp->delivered now during RTO recovery or not (I'm not 
proposing we fix this unless we want to fix the both issues at the same 
time here as its impact with SACK is not that significant)? Is 
tp->delivered supposed to under- or overestimate (in the cases we're not 
sure what/when something happened)? ...If it's overestimating under any 
circumstances (for the current ACK), we cannot base our cc decision on it.

tcp_check_reno_reordering might like to have the cumulatively acked count 
but due to the forementioned discontinuity we anyway cannot accurately 
provide in CA_Loss what tcp_limit_reno_sacked expects (and the 
other CA states are unaffected by this fix). While we could call 
tcp_check_reno_reordering directly from tcp_clean_rtx_queue, it wouldn't 
remove the fundamental discontinuity problem that we have for what 
tcp_limit_reno_sacked assumes about sacked_out. It might actually be so
that tcp_limit_reno_sacked is just never going to work after we zero 
tp->sacked_out (this would actually be the problem #3) or at least ends 
up underestimating the reordering degree by the amount we cleared from 
sacked_out at RTO?

> 2) current code: pkts_acked can substantially over-estimate the newly
> delivered pkts in both SACK and non-SACK cases. For example, let's say
> 99/100 packets are already sacked, and the next ACK acks 100 pkts.
> pkts_acked == 100 but really only one packet is delivered. It's wrong
> to inform congestion control that 100 packets have just delivered.
> AFAICT, the CCs that have pkts_acked callbacks all treat pkts_acked as
> the newly delivered packets.
>
> A better fix for both SACK and non-SACK, seems to be moving
> ca_ops->pkts_acked into tcp_cong_control, where the "acked_sacked" is
> calibrated? this is what BBR is currently doing to avoid these pitfalls.

Unfortunately that would not fix the problematic calculation of 
tp->delivered.

But you might be right that there's a problem #2 (it's hard to notice 
for real though as most of the cc modules don't seem to use it for 
anything). However, this is for pkts_acked so are you sure what the 
cc modules exactly expect: the shrinkage in # of outstanding packets or 
the number of newly delivered packets?

Looking (only) quickly into the modules (that use it other than in 
CA_Open):
- Somewhat suspicious delayed ACK detection logic in cdg
- HTCP might want shrinkage in # of outstanding packets (but I'm not 
entirely sure) for throughput measurement
- I guess TCP illinois is broken (assumes it's newly delivered packets) 
but it would not need to use it at all as it just proxies the value in 
a local variable into tcp_illinois_cong_avoid that has the correct acked 
readily available.

There are separate cong_avoid and cong_control callbacks that are more 
obviously oriented for doing cc where as I'd think pkts_acked callback 
seems more oriented to for RTT-sample related operations. Therefore, I'm 
not entirely sure I this is what is wanted for pkts_acked, especially 
given the HTCP example.


-- 
 i.