lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:	Tue, 6 Jan 2015 19:50:41 +0000
From:	Erik Grinaker <erik@...gler.no>
To:	Rick Jones <rick.jones2@...com>
Cc:	Yuchung Cheng <ycheng@...gle.com>,
	Eric Dumazet <eric.dumazet@...il.com>,
	"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
	netdev <netdev@...r.kernel.org>
Subject: Re: TCP connection issues against Amazon S3

On 06 Jan 2015, at 19:16, Rick Jones <rick.jones2@...com> wrote:
> 
>>>>>>> A packet dump [1] shows repeated ACK retransmits for some of the
>> TCP does not retransmit ACK ... do you mean DUPACKs sent by the receiver?
>> 
>> I am trying to understand the problem. Could you confirm that it's the
>> HTTP responses sent from Amazon S3 got stalled, or HTTP requests sent
>> from the receiver (your host)?
>> 
>> btw I suspect some middleboxes are stripping SACKOK options from your
>> SYNs (or Amazon SYN-ACKs) assuming Amazon supports SACK.
> 
> The TCP Timestamp option too it seems.
> 
> Speaking of middleboxes...  It is probably a fish that is red, but a while back I stepped in a middle box (a load balancer) which decided that if it saw "too many" retransmissions in a given TCP window that something was seriously wrong and it would toast the connection.  I thought though that was an active reset on the part of the middlebox. (And the client was the active sender not the back-end server)

It’s looking increasingly probable that it’s something like that, since the sender (S3) appears to disable SACKs on the failing clients, while it enables SACKs on other functioning clients.

> I'm assuming one incident starts at XX:41:24.748265 in the trace?  That does look like it is slowly slogging its way through a bunch of lost traffic, which was I think part of the problem I was seeing with the middlebox I stepped in, but I don't think I see the reset where I would have expected it.  Still, it looks like the sender has an increasing TCP RTO as it is going through the slog (as it likely must since there are no TCP timestamps?), to the point it gets larger than I'm guessing curl was willing to wait, so the FIN at XX:41:53.269534 after a ten second or so gap.

Yes, there is one incident starting at XX:41:23. All the RSTs are sent at the end though, at the 30s Curl timeout. I’ve put up a stripped down pcap of a single request here:

http://abstrakt.bengler.no/tcp-issues-s3-failure.pcap.bz2


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ