netdev - Re: TCP one-by-one acking - RFC interpretation question

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite for Android: free password hash cracker in your pocket

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20180411120655.aib34ye5xnch4zw3@unicorn.suse.cz>
Date:   Wed, 11 Apr 2018 14:06:55 +0200
From:   Michal Kubecek <mkubecek@...e.cz>
To:     netdev@...r.kernel.org
Cc:     Eric Dumazet <eric.dumazet@...il.com>,
        Yuchung Cheng <ycheng@...gle.com>,
        Neal Cardwell <ncardwell@...gle.com>,
        Kenneth Klette Jonassen <kennetkl@....uio.no>
Subject: Re: TCP one-by-one acking - RFC interpretation question

On Wed, Apr 11, 2018 at 12:58:37PM +0200, Michal Kubecek wrote:
> There is something else I don't understand, though. In the case of
> acking previously sacked and never retransmitted segment,
> tcp_clean_rtx_queue() calculates the parameters for tcp_ack_update_rtt()
> using
> 
>         if (sack->first_sackt.v64) {
>                 sack_rtt_us = skb_mstamp_us_delta(&now,
> &sack->first_sackt);
>                 ca_rtt_us = skb_mstamp_us_delta(&now,
> &sack->last_sackt);
>         }
> 
> (in 4.4; mainline code replaces &now with tp->tcp_mstamp). If I read the
> code correctly, both sack->first_sackt and sack->last_sackt contain
> timestamps of initial segment transmission. This would mean we use the
> time difference between the initial transmission and now, i.e. including
> the RTO of the lost packet).
> 
> IMHO we should take the actual round trip time instead, i.e. the
> difference between the original transmission and the time the packet
> sacked (first time). It seems we have been doing this before commit
> 31231a8a8730 ("tcp: improve RTT from SACK for CC").

Sorry for the noise, this was my misunderstanding, the first_sackt and
last_sackt values are only taken from segments newly sacked by ack
received right now, not those which were already sacked before.

The actual problem and unrealistic RTT measurements come from another
RFC violation I didn't mention before: the NAS doesn't follow RFC 2018
section 4 rule for ordering of SACK blocks. Rather than sending SACK
blocks three most recently received out-of-order blocks, it simply sends
first three ordered by sequence numbers. In the earlier example (odd
packets were received, even lost)

       ACK             SAK             SAK             SAK
    +-------+-------+-------+-------+-------+-------+-------+-------+-------+
    |   1   |   2   |   3   |   4   |   5   |   6   |   7   |   8   |   9   |
    +-------+-------+-------+-------+-------+-------+-------+-------+-------+
  34273   35701   37129   38557   39985   41413   42841   44269   45697   47125

it responds to retransmitted segment 2 by

  1. ACK 37129, SACK 37129-38557 39985-41413 42841-44269
  2. ACK 38557, SACK 39985-41413 42841-44269 45697-47125

This new SACK block 45697-47125 has not been retransmitted and as it
wasn't sacked before, it is considered newly sacked. Therefore it gets
processed and its deemed RTT (time since its original transmit time)
"poisons" the RTT calculation, leading to RTO spiraling up.

Thus if we want to work around the NAS behaviour, we would need to
recognize such new SACK block as "not really new" and ignore it for
first_sackt/last_sackt. I'm not sure if it's possible without
misinterpreting actually delayed out of order packets. Of course, it is
not clear if it's worth the effort to work around so severely broken TCP
implementations (two obvious RFC violations, even if we don't count the
one-by-one acking).

Michal Kubecek