netdev - Re: TCP one-by-one acking - RFC interpretation question

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20180411105837.bwnpoqvbra43kjub@unicorn.suse.cz>
Date:   Wed, 11 Apr 2018 12:58:37 +0200
From:   Michal Kubecek <mkubecek@...e.cz>
To:     netdev@...r.kernel.org
Cc:     Eric Dumazet <eric.dumazet@...il.com>,
        Yuchung Cheng <ycheng@...gle.com>,
        Neal Cardwell <ncardwell@...gle.com>,
        Kenneth Klette Jonassen <kennetkl@....uio.no>
Subject: Re: TCP one-by-one acking - RFC interpretation question

On Fri, Apr 06, 2018 at 09:49:58AM -0700, Eric Dumazet wrote:
> Cc Neal and Yuchung if they missed this thread.
> 
> On 04/06/2018 08:03 AM, Michal Kubecek wrote:
> > On Fri, Apr 06, 2018 at 05:01:29AM -0700, Eric Dumazet wrote:
> >>
> >>
> >> On 04/06/2018 03:05 AM, Michal Kubecek wrote:
> >>> Hello
> >>>
> >>> I encountered a strange behaviour of some (non-linux) TCP stack which
> >>> I believe is incorrect but support engineers from the company producing
> >>> it claim is OK.
> >>>
> >>> Assume a client (sender, Linux 4.4 kernel) sends a stream of MSS sized
> >>> segments but segments 2, 4 and 6 do not reach the server (receiver):
> >>>
> >>>          ACK             SAK             SAK             SAK
> >>>       +-------+-------+-------+-------+-------+-------+-------+
> >>>       |   1   |   2   |   3   |   4   |   5   |   6   |   7   |
> >>>       +-------+-------+-------+-------+-------+-------+-------+
> >>>     34273   35701   37129   38557   39985   41413   42841   44269
> >>>
> >>> When segment 2 is retransmitted after RTO timeout, normal response would
> >>> be ACK-ing segment 3 (38557) with SACK for 5 and 7 (39985-41413 and
> >>> 42841-44269).
> >>>
> >>> However, this server stack responds with two separate ACKs:
> >>>
> >>>   - ACK 37129, SACK 37129-38557 39985-41413 42841-44269
> >>>   - ACK 38557, SACK 39985-41413 42841-44269
> >>
> >> Hmmm... Yes this seems very very wrong and lazy.
> >>
> >> Have you verified behavior of more recent linux kernel to such threats ?
> > 
> > No, unfortunately the problem was only encountered by our customer in
> > production environment (they tried to reproduce in a test lab but no
> > luck). They are running backups to NFS server and it happens from time
> > to time (in the order of hours, IIUC). So it would be probably hard to
> > let them try with more recent kernel.
> > 
> > On the other hand, they reported that SLE11 clients (kernel 3.0) do not
> > run into this kind of problem. It was originally reported as a
> > a regression on migration from SLE11-SP4 (3.0 kernel) to SLE12-SP2 (4.4
> > kernel) and the problem was reported as "SLE12-SP2 is ignoring dupacks"
> > (which seems to be mostly caused by the switch to RACK).
> > 
> > It also seems that part of the problem is specific packet loss pattern
> > where at some point, many packets are lost in "every second" pattern.
> > The customer finally started to investigate this problem and it seems it
> > has something to do with their bonding setup (they provided no details,
> > my guess is packets are divided over two paths and one of them fails).
> > 
> >> packetdrill test would be relatively easy to write.
> > 
> > I'll try but I have very little experience with writing packetdrill
> > scripts so it will probably take some time.
> > 
> >> Regardless of this broken alien stack, we might be able to work around
> >> this faster than the vendor is able to fix and deploy a new stack.
> >>
> >> ( https://en.wikipedia.org/wiki/Robustness_principle )
> >> Be conservative in what you do, be liberal in what you accept from
> >> others...
> > 
> > I was thinking about this a bit. "Fixing" the acknowledgment number
> > could do the trick but it doesn't feel correct. We might use the fact
> > that TSecr of both ACKs above matches TSval of the retransmission which
> > triggered them so that RTT calculated from timestamp would be the right
> > one. So perhaps something like "prefer timestamp RTT if measured RTT
> > seems way too off". But I'm not sure if it couldn't break other use
> > cases where (high) measured RTT is actually correct, rather than (low)
> > timestamp RTT.

I stared at the code some more and apparently I was wrong. I put my
tracing right after the check

        if (flag & FLAG_SACK_RENEGING) {

in tcp_check_sack_reneging() and completely missed Neal's commit
5ae344c949e7 ("tcp: reduce spurious retransmits due to transient SACK
reneging") which deals with exactly this kind of broken TCP stacks
sending a series of acks for each segment from out of order queue.
Therefore it seems the events in my log weren't actual SACK reneging
events and the SACK scoreboard wasn't cleared.

There is something else I don't understand, though. In the case of
acking previously sacked and never retransmitted segment,
tcp_clean_rtx_queue() calculates the parameters for tcp_ack_update_rtt()
using

        if (sack->first_sackt.v64) {
                sack_rtt_us = skb_mstamp_us_delta(&now,
&sack->first_sackt);
                ca_rtt_us = skb_mstamp_us_delta(&now,
&sack->last_sackt);
        }

(in 4.4; mainline code replaces &now with tp->tcp_mstamp). If I read the
code correctly, both sack->first_sackt and sack->last_sackt contain
timestamps of initial segment transmission. This would mean we use the
time difference between the initial transmission and now, i.e. including
the RTO of the lost packet).

IMHO we should take the actual round trip time instead, i.e. the
difference between the original transmission and the time the packet
sacked (first time). It seems we have been doing this before commit
31231a8a8730 ("tcp: improve RTT from SACK for CC").

Michal Kubecek