netdev - Re: Problem with tcp (2.6.31) as first, http://bugzilla.kernel.org/show

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [day] [month] [year] [list]
Message-ID: <4B0FBB50.5080109@ans.pl>
Date:	Fri, 27 Nov 2009 12:43:12 +0100
From:	Krzysztof Olędzki <ole@....pl>
To:	Ilpo Järvinen <ilpo.jarvinen@...sinki.fi>
CC:	Eric Dumazet <eric.dumazet@...il.com>,
	David Miller <davem@...emloft.net>,
	Herbert Xu <herbert@...dor.apana.org.au>,
	netdev@...r.kernel.org
Subject: Re: Problem with tcp (2.6.31) as first, http://bugzilla.kernel.org/show_bug.cgi?id=14580

On 2009-11-27 12:04, Ilpo Järvinen wrote:
> On Fri, 27 Nov 2009, Krzysztof Oledzki wrote:
> 
>>
>> On Thu, 26 Nov 2009, Eric Dumazet wrote:
>>
>>> Ilpo Järvinen a écrit :
>>>> On Thu, 26 Nov 2009, Eric Dumazet wrote:
>>>>
>>>>> Eric Dumazet a écrit :
>>>>>> Krzysztof Olędzki a écrit :
>>>>>>> On 2009-11-26 21:47, Eric Dumazet wrote:
>>>>>>>
>>>>>>>> About wscale being not sent, I suppose last kernel is fixed
>>>>>>> Thanks, I'll check it. But it is quite strange, that while reusing
>>>>>>> the
>>>>>>> old connection, Linux uses wscale==0.
>>>>>>>
>>>>>> It only 'reuse' sequence of previous connection to compute its ISN.
>>>>>>
>>>>>> Its a new socket, a new connection, with possibly different RCVBUF
>>>>>> settings -> different window.
>>>>>>
>>>>>> In my tests on net-next-2.6, I always have wscale set, but I am using
>>>>>> a program of my own,
>>>>>> not full NFS setup.
>>>>>>
>>>>>>
>>>>> Well, it seems NFS reuses its socket, so maybe we miss some cleaning
>>>>> as spotted in this old patch :
>>>> ...Nice, so we have this reuse of socket after all. ...It seems that our
>>>> other bugs might have just been solved (wq purge can then cause stale
>>>> hints if this reusing is indeed true).
>>>>
>>> Indeed, and we can do this in user space too :)
>>>
>>>
>>>        sockaddr.sin_family = AF_INET;
>>>        sockaddr.sin_port = htons(PORT);
>>>        sockaddr.sin_addr.s_addr = inet_addr("192.168.20.112");
>>>        res = connect(fd, (struct sockaddr *)&sockaddr, sizeof(sockaddr));
>>> ...
>>>
>>> /*
>>> * following code calls tcp_disconnect()
>>> */
>>>        memset(&sockaddr, 0, sizeof(sockaddr));
>>>        sockaddr.sin_family = AF_UNSPEC;
>>>        connect(fd, (struct sockaddr *)&sockaddr, sizeof(sockaddr));
>>>
>>> /* reuse socket and reconnect on same target */
>>>        sockaddr.sin_family = AF_INET;
>>>        sockaddr.sin_port = htons(PORT);
>>>        sockaddr.sin_addr.s_addr = inet_addr("192.168.20.112");
>>>        res = connect(fd, (struct sockaddr *)&sockaddr, sizeof(sockaddr));
>>>
>>>
>>> I reproduced the problem (of too small window, wscale = 0 instead of 6)
>>>
>>>
>>>
>>> 23:14:53.608106 IP client.3434 > 192.168.20.112.333: S
>>> 392872616:392872616(0) win 5840 <mss 1460,nop,nop,timestamp 82516578
>>> 0,nop,wscale 6>
>>> 23:14:53.608199 IP 192.168.20.112.333 > client.3434: S
>>> 2753948468:2753948468(0) ack 392872617 win 5792 <mss 1460,nop,nop,timestamp
>>> 1660900486 82516578,nop,wscale 6>
>>> 23:14:53.608218 IP client.3434 > 192.168.20.112.333: . ack 1 win 92
>>> <nop,nop,timestamp 82516578 1660900486>
>>> 23:14:53.608232 IP client.3434 > 192.168.20.112.333: P 1:7(6) ack 1 win 92
>>> <nop,nop,timestamp 82516578 1660900486>
>>> 23:14:53.608320 IP 192.168.20.112.333 > client.3434: . ack 7 win 91
>>> <nop,nop,timestamp 1660900486 82516578>
>>> 23:14:53.608328 IP 192.168.20.112.333 > client.3434: P 1:7(6) ack 7 win 91
>>> <nop,nop,timestamp 1660900486 82516578>
>>> 23:14:53.608331 IP 192.168.20.112.333 > client.3434: F 7:7(0) ack 7 win 91
>>> <nop,nop,timestamp 1660900486 82516578>
>>> 23:14:53.608341 IP client.3434 > 192.168.20.112.333: . ack 7 win 92
>>> <nop,nop,timestamp 82516578 1660900486>
>>> 23:14:53.647202 IP client.3434 > 192.168.20.112.333: . ack 8 win 92
>>> <nop,nop,timestamp 82516618 1660900486>
>>> 23:14:56.614341 IP client.3434 > 192.168.20.112.333: F 7:7(0) ack 8 win 92
>>> <nop,nop,timestamp 82519584 1660900486>
>>> 23:14:56.614439 IP 192.168.20.112.333 > client.3434: . ack 8 win 91
>>> <nop,nop,timestamp 1660903493 82519584>
>>> 23:14:56.614461 IP client.3434 > 192.168.20.112.333: R
>>> 392872624:392872624(0) win 0
>>>
>>> <<HERE : win = 5840 wscale = 0>>
>>> 23:14:56.616260 IP client.3434 > 192.168.20.112.333: S
>>> 392878450:392878450(0) win 5840 <mss 1460,nop,nop,timestamp 82519586
>>> 0,nop,wscale 0>
>>>
>>> 23:14:56.616352 IP 192.168.20.112.333 > client.3434: S
>>> 2800950724:2800950724(0) ack 392878451 win 5792 <mss 1460,nop,nop,timestamp
>>> 1660903494 82519586,nop,wscale 6>
>>>
>>>
>>>
>>> Following patch solves this problem, but maybe we need a flag
>>> (a la sk->sk_userlocks |= SOCK_WINCLAMP_LOCK;)
>>> in case user set window_clamp.
>>> Or just document the clearing after a tcp disconnect ?
>>>
>>> [PATCH] tcp: tcp_disconnect() should clear window_clamp
>>>
>>> Or reuse of socket possibly selects a small window, wscale = 0 for next
>>> connection.
>>>
>>> diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
>>> index 524f976..7d4648f 100644
>>> --- a/net/ipv4/tcp.c
>>> +++ b/net/ipv4/tcp.c
>>> @@ -2059,6 +2059,7 @@ int tcp_disconnect(struct sock *sk, int flags)
>>> 	tp->snd_ssthresh = TCP_INFINITE_SSTHRESH;
>>> 	tp->snd_cwnd_cnt = 0;
>>> 	tp->bytes_acked = 0;
>>> +	tp->window_clamp = 0;
>>> 	tcp_set_ca_state(sk, TCP_CA_Open);
>>> 	tcp_clear_retrans(tp);
>>> 	inet_csk_delack_init(sk);
>>>
>> Thanks!
>>
>> 10:07:31.429627 IP 192.168.152.205.44678 > 192.168.152.20.2049: Flags [S], seq
>> 2012792102, win 5840, options [mss 1460,sackOK,TS val 4294877898 ecr
>> 0,nop,wscale 7], length 0
>> 10:07:31.429736 IP 192.168.152.20.2049 > 192.168.152.205.44678: Flags [S.],
>> seq 1548680033, ack 2012792103, win 5792, options [mss 1460,sackOK,TS val
>> 68439846 ecr 4294877898,nop,wscale 7], length 0
>> (switching servers)
>> 10:08:05.186989 IP 192.168.152.20.2049 > 192.168.152.205.44678: Flags [R], seq
>> 1548680550, win 0, length 0
>> 10:08:11.187117 IP 192.168.152.205.44678 > 192.168.152.20.2049: Flags [S], seq
>> 2012804321, win 5840, options [mss 1460,sackOK,TS val 4294917656 ecr
>> 0,nop,wscale 7], length 0
>> 10:08:11.187276 IP 192.168.152.20.2049 > 192.168.152.205.44678: Flags [S.],
>> seq 2176044714, ack 2012804322, win 5792, options [mss 1460,sackOK,TS val
>> 68482560 ecr 4294917656,nop,wscale 7], length 0
>>
>> This indeed fixes the problem with missing/zero wscale, however the original
>> problem (tcp loop flood) still remains. I wonder why the client is not able to
>> handle it, especially that seq numbers received from both servers are
>> distanced by much, much more than the current window size: 627364681 is much
>> larger than 5840 << 7 (747520).
> 
> What would you expect to happen? If out-of-window stuff arrives we send 
> dupacks. If we would send resets, that would introduce blind rst attacks.
> In theory we might be able to quench the loop by using pingpong thing but 
> that needs very careful thought in order to not introduce other problems,
> and even then your connections will not be re-usable until either end 
> times out so the gain is rather limited. We simply cannot rst the 
> connection, that's not an option.

Right, the idea of sending RST is indeed stupid. But at risk of being 
silly, why do we need to send anything in response to out-of-window 
packets? Especially as we are doing it without ratelimiting, even for a 
packet that contains no data, only pure ack. The current situation can 
be easily abused for a hard to trace DoS - just send a lot of spoofed 
packed to a port from established connection and such server will 
response at the same rate, flooding the client.

> I find this problem simply stem from the introduced loss of end-to-end 
> connectivity. Would you really "lose" that server so that its TCP state is 
> not maintained, you'd get resets etc (crash, scheduled reboot or 
> whatever).

In brain-split situation, when your network is temporary segmented, two 
redundant servers may take the same VIP simultaneously. When such 
networks restores full functionality, one of the servers loses the IP. 
It is how I have found this problem.

> Only real solution would be a kill switch for TCP connection 
> when you break e-2-e connectivity (ie., switch servers so that the same IP 
> is reacquired by somebody else). In theory you can "simulate" the kill 
> switch by setting tcp_retries sysctls to small values to make the 
> connections to timeout much faster, but still that might not be enough for 
> you (and has other implications you might not like). 

Now I wonder - maybe we can simply kill ESTABLISHED connections 
containing a addresses being removed?

>> But there is one more thing that still bugs me, please look at one of my
>> previous dump:
>>
>> 17:39:48.339503 IP (tos 0x0, ttl 64, id 31305, offset 0, flags [DF], proto TCP
>> (6), length 56)
>>      192.168.152.205.55329 > 192.168.152.20.2049: Flags [S], cksum 0x7d35
>> (correct), seq 3093379972, win 5840, options [mss 1460,sackOK,TS val 16845 ecr
>> 0], length 0
>>
>> 17:39:48.339588 IP (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto TCP
>> (6), length 56)
>>      192.168.152.20.2049 > 192.168.152.205.55329: Flags [S.], cksum 0x7930
>> (correct), seq 4250661905, ack 3093379973, win 5792, options [mss
>> 1460,sackOK,TS val 9179690 ecr 16845], length 0
>>
>> OK, now we know that the client is buggy and sends small windows, but why the
>> response from the server also contains so small window?
> 
> Perhaps I don't fully understand what you find here to be a problem... 
> Anyway, initially we start with small window and enlarge it while we keep 
> going (receiver window auto-tuning).

Yes, Eric already explained it to me. I was just debating why we had 
started with such small window here. Normally, with wscale enabled the 
window is much higher, so without wscale it should be higher too, of 
course with respect to the 16bit limit. But as windows can grow later, 
it is not a problem.

Best regards,

				Krzysztof Olędzki
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html