[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <alpine.DEB.2.20.1803271337340.29120@whs-18.cs.helsinki.fi>
Date: Tue, 27 Mar 2018 17:23:43 +0300 (EEST)
From: Ilpo Järvinen <ilpo.jarvinen@...sinki.fi>
To: Yuchung Cheng <ycheng@...gle.com>
cc: netdev <netdev@...r.kernel.org>,
Neal Cardwell <ncardwell@...gle.com>,
Eric Dumazet <edumazet@...gle.com>,
Sergei Shtylyov <sergei.shtylyov@...entembedded.com>
Subject: Re: [PATCH v3 net 1/5] tcp: feed correct number of pkts acked to cc
modules also in recovery
On Mon, 26 Mar 2018, Yuchung Cheng wrote:
> On Tue, Mar 13, 2018 at 3:25 AM, Ilpo Järvinen
> <ilpo.jarvinen@...sinki.fi> wrote:
> >
> > A miscalculation for the number of acknowledged packets occurs during
> > RTO recovery whenever SACK is not enabled and a cumulative ACK covers
> > any non-retransmitted skbs. The reason is that pkts_acked value
> > calculated in tcp_clean_rtx_queue is not correct for slow start after
> > RTO as it may include segments that were not lost and therefore did
> > not need retransmissions in the slow start following the RTO. Then
> > tcp_slow_start will add the excess into cwnd bloating it and
> > triggering a burst.
> >
> > Instead, we want to pass only the number of retransmitted segments
> > that were covered by the cumulative ACK (and potentially newly sent
> > data segments too if the cumulative ACK covers that far).
> >
> > Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@...sinki.fi>
> > ---
> > net/ipv4/tcp_input.c | 16 +++++++++++++++-
> > 1 file changed, 15 insertions(+), 1 deletion(-)
> >
> > diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
> > index 9a1b3c1..4a26c09 100644
> > --- a/net/ipv4/tcp_input.c
> > +++ b/net/ipv4/tcp_input.c
> > @@ -3027,6 +3027,8 @@ static int tcp_clean_rtx_queue(struct sock *sk, u32 prior_fack,
> > long seq_rtt_us = -1L;
> > long ca_rtt_us = -1L;
> > u32 pkts_acked = 0;
> > + u32 rexmit_acked = 0;
> > + u32 newdata_acked = 0;
> > u32 last_in_flight = 0;
> > bool rtt_update;
> > int flag = 0;
> > @@ -3056,8 +3058,10 @@ static int tcp_clean_rtx_queue(struct sock *sk, u32 prior_fack,
> > }
> >
> > if (unlikely(sacked & TCPCB_RETRANS)) {
> > - if (sacked & TCPCB_SACKED_RETRANS)
> > + if (sacked & TCPCB_SACKED_RETRANS) {
> > tp->retrans_out -= acked_pcount;
> > + rexmit_acked += acked_pcount;
> > + }
> > flag |= FLAG_RETRANS_DATA_ACKED;
> > } else if (!(sacked & TCPCB_SACKED_ACKED)) {
> > last_ackt = skb->skb_mstamp;
> > @@ -3070,6 +3074,8 @@ static int tcp_clean_rtx_queue(struct sock *sk, u32 prior_fack,
> > reord = start_seq;
> > if (!after(scb->end_seq, tp->high_seq))
> > flag |= FLAG_ORIG_SACK_ACKED;
> > + else
> > + newdata_acked += acked_pcount;
> > }
> >
> > if (sacked & TCPCB_SACKED_ACKED) {
> > @@ -3151,6 +3157,14 @@ static int tcp_clean_rtx_queue(struct sock *sk, u32 prior_fack,
> > }
> >
> > if (tcp_is_reno(tp)) {
> > + /* Due to discontinuity on RTO in the artificial
> > + * sacked_out calculations, TCP must restrict
> > + * pkts_acked without SACK to rexmits and new data
> > + * segments
> > + */
> > + if (icsk->icsk_ca_state == TCP_CA_Loss)
> > + pkts_acked = rexmit_acked + newdata_acked;
> > +
> My understanding is there are two problems
>
> 1) your fix: the reordering logic in tcp-remove_reno_sacks requires
> precise cumulatively acked count, not newly acked count?
While I'm not entirely sure if you intented to say that my fix is broken
or not, I thought this very difference alot while making the fix and I
believe that this fix is needed because of the discontinuity at RTO
(sacked_out is cleared as we set L-bits + lost_out). This is an artifact
in the imitation of sacked_out for non-SACK but at RTO we can't keep that
in sync because we set L-bits (and have no S-bits to guide us). Thus, we
cannot anymore "use" those skbs with only L-bit for the reno_sacks logic.
In tcp_remove_reno_sacks acked - sacked_out is being used to calculate
tp->delivered, using plain cumulative acked causes congestion control
breakage later as call to tcp_cong_control will directly use the
difference in tp->delivered.
This boils down the exact definition of tp->delivered (the one given in
the header is not detailed enough). I guess you might have better idea
what it exactly is since one of you has added it? There are subtle things
in the defination that can make it entirely unsuitable for cc decisions.
Should those segments that we (possibly) already counted into
tp->delivered during (potentially preceeding) CA_Recovery be added to it
for _second time_ or not? This fix avoids such double counting (the
price is that we might underestimate). For SACK+ACK losses, the similar
question is: Should those segments that we missed counting into
tp->delivered during preceeding CA_Recovery (due to losing enough SACKs)
be added into tp->delivered now during RTO recovery or not (I'm not
proposing we fix this unless we want to fix the both issues at the same
time here as its impact with SACK is not that significant)? Is
tp->delivered supposed to under- or overestimate (in the cases we're not
sure what/when something happened)? ...If it's overestimating under any
circumstances (for the current ACK), we cannot base our cc decision on it.
tcp_check_reno_reordering might like to have the cumulatively acked count
but due to the forementioned discontinuity we anyway cannot accurately
provide in CA_Loss what tcp_limit_reno_sacked expects (and the
other CA states are unaffected by this fix). While we could call
tcp_check_reno_reordering directly from tcp_clean_rtx_queue, it wouldn't
remove the fundamental discontinuity problem that we have for what
tcp_limit_reno_sacked assumes about sacked_out. It might actually be so
that tcp_limit_reno_sacked is just never going to work after we zero
tp->sacked_out (this would actually be the problem #3) or at least ends
up underestimating the reordering degree by the amount we cleared from
sacked_out at RTO?
> 2) current code: pkts_acked can substantially over-estimate the newly
> delivered pkts in both SACK and non-SACK cases. For example, let's say
> 99/100 packets are already sacked, and the next ACK acks 100 pkts.
> pkts_acked == 100 but really only one packet is delivered. It's wrong
> to inform congestion control that 100 packets have just delivered.
> AFAICT, the CCs that have pkts_acked callbacks all treat pkts_acked as
> the newly delivered packets.
>
> A better fix for both SACK and non-SACK, seems to be moving
> ca_ops->pkts_acked into tcp_cong_control, where the "acked_sacked" is
> calibrated? this is what BBR is currently doing to avoid these pitfalls.
Unfortunately that would not fix the problematic calculation of
tp->delivered.
But you might be right that there's a problem #2 (it's hard to notice
for real though as most of the cc modules don't seem to use it for
anything). However, this is for pkts_acked so are you sure what the
cc modules exactly expect: the shrinkage in # of outstanding packets or
the number of newly delivered packets?
Looking (only) quickly into the modules (that use it other than in
CA_Open):
- Somewhat suspicious delayed ACK detection logic in cdg
- HTCP might want shrinkage in # of outstanding packets (but I'm not
entirely sure) for throughput measurement
- I guess TCP illinois is broken (assumes it's newly delivered packets)
but it would not need to use it at all as it just proxies the value in
a local variable into tcp_illinois_cong_avoid that has the correct acked
readily available.
There are separate cong_avoid and cong_control callbacks that are more
obviously oriented for doing cc where as I'd think pkts_acked callback
seems more oriented to for RTT-sample related operations. Therefore, I'm
not entirely sure I this is what is wanted for pkts_acked, especially
given the HTCP example.
--
i.
Powered by blists - more mailing lists