netdev - Re: Invalid timestamp? causing tight ack loop (hundreds of thousands of packets / sec)

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <CAO-X30tWuhwTPd6dJPc9eap0HFSw2+aH_hejO59bdTU9W=c4Fw@mail.gmail.com>
Date:	Wed, 4 Feb 2015 00:37:46 -0800
From:	Avery Fay <avery@...panel.com>
To:	Eric Dumazet <eric.dumazet@...il.com>
Cc:	netdev@...r.kernel.org, Neal Cardwell <ncardwell@...gle.com>
Subject: Re: Invalid timestamp? causing tight ack loop (hundreds of thousands
 of packets / sec)

Sorry, forgot to mention: the packet capture has been filtered to only
the ip/port combo from above. I also stopped the capture early before
the traffic would naturally stop because I wanted to disable
timestamps again as soon as possible.

Avery

On Wed, Feb 4, 2015 at 12:35 AM, Avery Fay <avery@...panel.com> wrote:
> Sure, https://dl.dropboxusercontent.com/u/9777748/loop.pcap.gz
>
> Also, no idea if it helps, but here's the traceroute:
>
> HOST: apibalancer-wdc-05                          Loss%   Snt   Last
> Avg  Best  Wrst StDev
>   1.|-- 184.173.130.1-static.reverse.softlayer.com   0.0%    10    0.2
>   1.9   0.2   9.6   3.1
>   2.|-- 208.43.118.164-static.reverse.softlayer.com  0.0%    10    0.2
>   0.2   0.2   0.3   0.0
>   3.|-- ae8.bbr02.eq01.wdc02.networklayer.com        0.0%    10    1.2
>   1.3   1.1   2.6   0.5
>   4.|-- ash-b1-link.telia.net                        0.0%    10    1.2
>   4.3   1.1  12.7   5.0
>   5.|-- ash-bb3-link.telia.net                       0.0%    10    1.2
>   2.6   1.1  15.0   4.4
>   6.|-- atl-bb1-link.telia.net                       0.0%    10   13.4
>  13.3 13.3 13.4 0.0
>   7.|-- 213.248.94.220                               0.0%    10   14.2
>  14.2  14.2  14.4   0.1
>   8.|-- 130.207.254.6                                0.0%    10   14.2
>  14.2  14.1  14.3   0.1
>     |  `|-- 130.207.254.185
>   9.|-- gateway2-rtr.gatech.edu                      0.0%    10   14.2
>  14.3 14.1 14.8 0.2
>     |  `|-- 143.215.254.97
>  10.|-- 143.215.254.97                               0.0%    10   14.6
>  14.6  14.4  14.7   0.1
>     |  `|-- 143.215.253.114
>  11.|-- 143.215.253.114                              0.0%    10   15.1
>  14.7  14.5  15.1   0.2
>  12.|-- ???                                         100.0    10    0.0
>   0.0   0.0   0.0   0.0
>
> On Wed, Feb 4, 2015 at 12:03 AM, Eric Dumazet <eric.dumazet@...il.com> wrote:
>> On Tue, 2015-02-03 at 22:50 -0800, Avery Fay wrote:
>>> Hello,
>>>
>>> Let me say first: if there's a better place to ask this, please point
>>> me in that direction.
>>>
>>> We've been having huge packets / sec spikes in the past few days.
>>> After some investigation, it looks like single connections are getting
>>> stuck in a loop (see tcpdump below). Each "stuck" connection will
>>> generate about 200kpps. It looks like our side is rejecting packets
>>> with "packets rejects in established connections because of timestamp"
>>> from netstat -s (internally PAWSEstab counter) and then generating an
>>> additional packet that we send out. All of these connections originate
>>> from georgia tech, but so far (not completely verified) it doesn't
>>> seem like there's any pattern to the client/os other than the fact
>>> that they're trying to make an https request to us.
>>>
>>> As a temporary countermeasure, we've disabled net.ipv4.tcp_timestamps,
>>> which solves the immediate problem.
>>>
>>> Our server is 174.36.240.86 running Ubuntu 12.04 with kernel 3.13.0-35-generic
>>>
>>> The client is 128.61.57.205 and in this case almost certainly has user
>>> agent (we found successful requests 10 seconds before the tcpdump with
>>> same ip): Dalvik/2.1.0 (Linux; U; Android 5.0; XT1095
>>> Build/LXE22.46-11)
>>>
>>> Beginning of tcpdump:
>>
>> ...
>>
>>>
>>> At this point, it just repeats until some timeout is hit. I haven't
>>> timed it, but probably one or two minutes.
>>>
>>> I guess I have a few questions:
>>>
>>> 1.) What's going on here? It looks like maybe there's some packet loss
>>> and then connection termination gets stuck in a loop because the
>>> client timestamp went down?
>>> 2.) Is there a better way to mitigate this other than disabling
>>> tcp_timestamps or blocking gatech ips?
>>> 3.) Is this our problem (ok, obviously our problem since we're
>>> affected but...), a kernel problem, or a gatech problem?
>>>
>>> I'd really appreciate any help on this,
>>
>> Would you have a pcap file instead ?
>>
>> It looks a middlebox is broken, I dont think Android could possibly send
>> a frame with no payload, but with Push flag.
>>
>> Neal has some patches that add a rate limiting on DACKS, that we might
>> upstream. (per socket rate limiting of 2 DACK per second)
>>
>> Thanks
>>
>>
>>
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html