netdev - Re: [PATCH v3 net 1/5] tcp: feed correct number of pkts acked to cc modules also in recovery

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <CAK6E8=ed3Lt4seZNv+veaVT-U893FVHqWGWhzwfEXyeb-yAkhA@mail.gmail.com>
Date:   Mon, 26 Mar 2018 19:07:12 -0700
From:   Yuchung Cheng <ycheng@...gle.com>
To:     Ilpo Järvinen <ilpo.jarvinen@...sinki.fi>
Cc:     netdev <netdev@...r.kernel.org>,
        Neal Cardwell <ncardwell@...gle.com>,
        Eric Dumazet <edumazet@...gle.com>,
        Sergei Shtylyov <sergei.shtylyov@...entembedded.com>
Subject: Re: [PATCH v3 net 1/5] tcp: feed correct number of pkts acked to cc
 modules also in recovery

On Tue, Mar 13, 2018 at 3:25 AM, Ilpo Järvinen
<ilpo.jarvinen@...sinki.fi> wrote:
>
> A miscalculation for the number of acknowledged packets occurs during
> RTO recovery whenever SACK is not enabled and a cumulative ACK covers
> any non-retransmitted skbs. The reason is that pkts_acked value
> calculated in tcp_clean_rtx_queue is not correct for slow start after
> RTO as it may include segments that were not lost and therefore did
> not need retransmissions in the slow start following the RTO. Then
> tcp_slow_start will add the excess into cwnd bloating it and
> triggering a burst.
>
> Instead, we want to pass only the number of retransmitted segments
> that were covered by the cumulative ACK (and potentially newly sent
> data segments too if the cumulative ACK covers that far).
>
> Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@...sinki.fi>
> ---
>  net/ipv4/tcp_input.c | 16 +++++++++++++++-
>  1 file changed, 15 insertions(+), 1 deletion(-)
>
> diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
> index 9a1b3c1..4a26c09 100644
> --- a/net/ipv4/tcp_input.c
> +++ b/net/ipv4/tcp_input.c
> @@ -3027,6 +3027,8 @@ static int tcp_clean_rtx_queue(struct sock *sk, u32 prior_fack,
>         long seq_rtt_us = -1L;
>         long ca_rtt_us = -1L;
>         u32 pkts_acked = 0;
> +       u32 rexmit_acked = 0;
> +       u32 newdata_acked = 0;
>         u32 last_in_flight = 0;
>         bool rtt_update;
>         int flag = 0;
> @@ -3056,8 +3058,10 @@ static int tcp_clean_rtx_queue(struct sock *sk, u32 prior_fack,
>                 }
>
>                 if (unlikely(sacked & TCPCB_RETRANS)) {
> -                       if (sacked & TCPCB_SACKED_RETRANS)
> +                       if (sacked & TCPCB_SACKED_RETRANS) {
>                                 tp->retrans_out -= acked_pcount;
> +                               rexmit_acked += acked_pcount;
> +                       }
>                         flag |= FLAG_RETRANS_DATA_ACKED;
>                 } else if (!(sacked & TCPCB_SACKED_ACKED)) {
>                         last_ackt = skb->skb_mstamp;
> @@ -3070,6 +3074,8 @@ static int tcp_clean_rtx_queue(struct sock *sk, u32 prior_fack,
>                                 reord = start_seq;
>                         if (!after(scb->end_seq, tp->high_seq))
>                                 flag |= FLAG_ORIG_SACK_ACKED;
> +                       else
> +                               newdata_acked += acked_pcount;
>                 }
>
>                 if (sacked & TCPCB_SACKED_ACKED) {
> @@ -3151,6 +3157,14 @@ static int tcp_clean_rtx_queue(struct sock *sk, u32 prior_fack,
>                 }
>
>                 if (tcp_is_reno(tp)) {
> +                       /* Due to discontinuity on RTO in the artificial
> +                        * sacked_out calculations, TCP must restrict
> +                        * pkts_acked without SACK to rexmits and new data
> +                        * segments
> +                        */
> +                       if (icsk->icsk_ca_state == TCP_CA_Loss)
> +                               pkts_acked = rexmit_acked + newdata_acked;
> +
My understanding is there are two problems

1) your fix: the reordering logic in tcp-remove_reno_sacks requires
precise cumulatively acked count, not newly acked count?


2) current code: pkts_acked can substantially over-estimate the newly
delivered pkts in both SACK and non-SACK cases. For example, let's say
99/100 packets are already sacked, and the next ACK acks 100 pkts.
pkts_acked == 100 but really only one packet is delivered. It's wrong
to inform congestion control that 100 packets have just delivered.
AFAICT, the CCs that have pkts_acked callbacks all treat pkts_acked as
the newly delivered packets.

A better fix for both SACK and non-SACK, seems to be moving
ca_ops->pkts_acked into tcp_cong_control, where the "acked_sacked" is
calibrated? this is what BBR is currently doing to avoid these pitfalls.


>                         tcp_remove_reno_sacks(sk, pkts_acked);
>                 } else {
>                         int delta;
> --
> 2.7.4
>