netdev - Re: TCP fast retransmit issues

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <CADVnQym9JpC+vDVPVjP0ibhPQu3NhxsynRYA--FNzAgKJUJQSg@mail.gmail.com>
Date:   Fri, 28 Jul 2017 18:54:41 -0400
From:   Neal Cardwell <ncardwell@...gle.com>
To:     Willy Tarreau <w@....eu>
Cc:     Eric Dumazet <eric.dumazet@...il.com>, Klavs Klavsen <kl@...n.dk>,
        Netdev <netdev@...r.kernel.org>,
        Yuchung Cheng <ycheng@...gle.com>,
        Nandita Dukkipati <nanditad@...gle.com>
Subject: Re: TCP fast retransmit issues

On Wed, Jul 26, 2017 at 3:02 PM, Neal Cardwell <ncardwell@...gle.com> wrote:
> On Wed, Jul 26, 2017 at 2:38 PM, Neal Cardwell <ncardwell@...gle.com> wrote:
>> Yeah, it looks like I can reproduce this issue with (1) bad sacks
>> causing repeated TLPs, and (2) TLPs timers being pushed out to later
>> times due to incoming data. Scripts are attached.
>
> I'm testing a fix of only scheduling a TLP if (flag & FLAG_DATA_ACKED)
> is true...

An update for the TLP aspect of this thread: our team has a proposed
fix for this RTO/TLP reschedule issue that we have reviewed internally
and tested with our packetdrill test suite, including some new tests.
The basic approach in the fix is as follows:

a) only reschedule the xmit timer once per ACK

b) only reschedule the xmit timer if tcp_clean_rtx_queue() deems this
is safe (a packet was cumulatively ACKed, or we got a SACK for a
packet that was sent before the most recent retransmit of the write
queue head).

After further review and testing we will post it. Hopefully next week.

thanks,
neal