netdev - Re: tcp: multiple ssthresh reductions before all packets are retransmitted

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Date:	Mon, 16 Jun 2014 10:02:08 -0700
From:	Yuchung Cheng <ycheng@...gle.com>
To:	Michal Kubecek <mkubecek@...e.cz>
Cc:	netdev <netdev@...r.kernel.org>
Subject: Re: tcp: multiple ssthresh reductions before all packets are retransmitted

On Mon, Jun 16, 2014 at 8:35 AM, Michal Kubecek <mkubecek@...e.cz> wrote:
> Hello,
>
> while investigating a problem with TCP loss recovery, I noticed that if
> large window is lost and another loss happens before the recovery is
> finished, ssthresh can drop to very low values. This is especially
> harmful with Reno congestion control where cwnd growth in congestion
> avoidance phase is linear and rather slow. This is 100% reproducible
> on older (3.0) kernels but I was able to reproduce it on 3.15 as well.
>
> RFC 5681 says that ssthresh reduction in response to RTO should be done
> only once and should not be repeated until all packets from the first
> loss are retransmitted. RFC 6582 (as well as its predecessor RFC 3782)
> is even more specific and says that when loss is detected, one should
> mark current SND.NXT and ssthresh shouldn't be reduced again due to
> a loss until SND.UNA reaches this remembered value.
>
> Linux implementation does exactly that but in TCP_CA_Loss state,
> tcp_enter_loss() also takes icsk_retransmits into account:
>
>         /* Reduce ssthresh if it has not yet been made inside this window. */
>         if (icsk->icsk_ca_state <= TCP_CA_Disorder ||
>             !after(tp->high_seq, tp->snd_una) ||
>             (icsk->icsk_ca_state == TCP_CA_Loss && !icsk->icsk_retransmits)) {
>                 new_recovery = true;
>                 tp->prior_ssthresh = tcp_current_ssthresh(sk);
>                 tp->snd_ssthresh = icsk->icsk_ca_ops->ssthresh(sk);
>                 tcp_ca_event(sk, CA_EVENT_LOSS);
>         }
>
> This seems correct as icsk_retransmits is supposed to mean "we still
> have packets to retransmit". But icsk_retransmits is reset in
> tcp_process_loss() if one of two conditions is satisfied:

icsk_retransmits indicates the number of recurring timeouts (of the
same sequence). so it is reset when the recovery is done or SND_UNA is
advanced. the variable name however is confusing.

does your suggested change fix the problem you are observing?

>
>         bool recovered = !before(tp->snd_una, tp->high_seq);
> ...
>         if (recovered) {
>                 /* F-RTO RFC5682 sec 3.1 step 2.a and 1st part of step 3.a */
>                 icsk->icsk_retransmits = 0;
>                 tcp_try_undo_recovery(sk);
>                 return;
>         }
>         if (flag & FLAG_DATA_ACKED)
>                 icsk->icsk_retransmits = 0;
>
> First part implements the condition from RFC. But the second part can
> reset icsk_retransmits much earlier than all previously lost data are
> retransmitted so that ssthresh can be reduced more than once.
>
> The first is recent and comes from commits ab42d9ee3 ("tcp: refactor
> CA_Loss state processing") and e33099f96 ("tcp: implement RFC5682
> F-RTO"). Second test is much older and predates the start of git tree.
>
> I believe this code is actually wrong and we should remove those two
> lines:
>
> diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
> index 40661fc..f534723 100644
> --- a/net/ipv4/tcp_input.c
> +++ b/net/ipv4/tcp_input.c
> @@ -2710,8 +2710,6 @@ static void tcp_process_loss(struct sock *sk, int flag, bool is_dupack)
>                 tcp_try_undo_recovery(sk);
>                 return;
>         }
> -       if (flag & FLAG_DATA_ACKED)
> -               icsk->icsk_retransmits = 0;
>         if (tcp_is_reno(tp)) {
>                 /* A Reno DUPACK means new data in F-RTO step 2.b above are
>                  * delivered. Lower inflight to clock out (re)tranmissions.
>
> But as I don't know where do they come from, I can't be sure they don't
> serve some purpose so I wanted to ask first if someone knows the logic
> behind them.
>
> Thanks in advance,
> Michal Kubecek
>
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@...r.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html