netdev - Re: TCP one-by-one acking - RFC interpretation question

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20180406150345.i3t7cu5g6dmoc3io@unicorn.suse.cz>
Date:   Fri, 6 Apr 2018 17:03:45 +0200
From:   Michal Kubecek <mkubecek@...e.cz>
To:     Eric Dumazet <eric.dumazet@...il.com>
Cc:     netdev@...r.kernel.org
Subject: Re: TCP one-by-one acking - RFC interpretation question

On Fri, Apr 06, 2018 at 05:01:29AM -0700, Eric Dumazet wrote:
> 
> 
> On 04/06/2018 03:05 AM, Michal Kubecek wrote:
> > Hello,
> > 
> > I encountered a strange behaviour of some (non-linux) TCP stack which
> > I believe is incorrect but support engineers from the company producing
> > it claim is OK.
> > 
> > Assume a client (sender, Linux 4.4 kernel) sends a stream of MSS sized
> > segments but segments 2, 4 and 6 do not reach the server (receiver):
> > 
> >          ACK             SAK             SAK             SAK
> >       +-------+-------+-------+-------+-------+-------+-------+
> >       |   1   |   2   |   3   |   4   |   5   |   6   |   7   |
> >       +-------+-------+-------+-------+-------+-------+-------+
> >     34273   35701   37129   38557   39985   41413   42841   44269
> > 
> > When segment 2 is retransmitted after RTO timeout, normal response would
> > be ACK-ing segment 3 (38557) with SACK for 5 and 7 (39985-41413 and
> > 42841-44269).
> > 
> > However, this server stack responds with two separate ACKs:
> > 
> >   - ACK 37129, SACK 37129-38557 39985-41413 42841-44269
> >   - ACK 38557, SACK 39985-41413 42841-44269
> 
> Hmmm... Yes this seems very very wrong and lazy.
> 
> Have you verified behavior of more recent linux kernel to such threats ?

No, unfortunately the problem was only encountered by our customer in
production environment (they tried to reproduce in a test lab but no
luck). They are running backups to NFS server and it happens from time
to time (in the order of hours, IIUC). So it would be probably hard to
let them try with more recent kernel.

On the other hand, they reported that SLE11 clients (kernel 3.0) do not
run into this kind of problem. It was originally reported as a
a regression on migration from SLE11-SP4 (3.0 kernel) to SLE12-SP2 (4.4
kernel) and the problem was reported as "SLE12-SP2 is ignoring dupacks"
(which seems to be mostly caused by the switch to RACK).

It also seems that part of the problem is specific packet loss pattern
where at some point, many packets are lost in "every second" pattern.
The customer finally started to investigate this problem and it seems it
has something to do with their bonding setup (they provided no details,
my guess is packets are divided over two paths and one of them fails).

> packetdrill test would be relatively easy to write.

I'll try but I have very little experience with writing packetdrill
scripts so it will probably take some time.

> Regardless of this broken alien stack, we might be able to work around
> this faster than the vendor is able to fix and deploy a new stack.
> 
> ( https://en.wikipedia.org/wiki/Robustness_principle )
> Be conservative in what you do, be liberal in what you accept from
> others...

I was thinking about this a bit. "Fixing" the acknowledgment number
could do the trick but it doesn't feel correct. We might use the fact
that TSecr of both ACKs above matches TSval of the retransmission which
triggered them so that RTT calculated from timestamp would be the right
one. So perhaps something like "prefer timestamp RTT if measured RTT
seems way too off". But I'm not sure if it couldn't break other use
cases where (high) measured RTT is actually correct, rather than (low)
timestamp RTT.

Michal Kubecek