netdev - TCP sockets stalling

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-id: <4B0CB1B8.8030402@shiftmail.org>
Date:	Wed, 25 Nov 2009 05:25:28 +0100
From:	Asdo <asdo@...ftmail.org>
To:	netdev@...r.kernel.org, e1000-devel@...ts.sourceforge.net
Subject: TCP sockets stalling - help! (long)

Hi netdev and e1000 people,

I have a weird problem here.

We have a Linux "Storage" server acting as a Samba, SCP and SFTP 
fileserver. All clients are Windows: XP and Vista, 32 and 64 bits. The 
building has gigabit network. We transfer up to many TB at a time.

I have tried from many clients to the same Server and there is always 
the same problem: after hours that a transfer is running the TCP socket 
suddenly stalls: 0 bytes/sec. It doesn't slow down, it stalls suddenly. 
It never disconnects though.

 From that moment onward, the TCP socket stays in a stalling condition 
forever (I saw myself at least 10 hours) and both sides see it 
Established forever. This is in case of no other networking activity on 
the same network card.
In case of other networking activity someone told me that he did see the 
transfer restart (=continue) by itself (it was an SFTP that time). I 
think I saw this phenomenon myself once during the tcpdump at the bottom 
of this email.

This problem happens on every protocol (Samba, SCP, SFTP) and apparently 
any client (for sure I tested on at least 1 Vista 64bit and 1 XP 32bit).

Also happens in the middle of transferring one file, so not necessarily 
at file change. The files are being pushed from the clients to the linux 
Server.

The gigabit switch between the machines was changed 2 times (the brands 
also were different) without results.

I was able to obtain apparently reliable nonstop transfer if I performed 
6 simultaneous transfers with Samba from the same client to the same 
server. OTOH with just 1 - 3 simultaneous transfers only all of them 
will eventually stall. Note that with 2 transfers the speed of each one 
is already smaller than the speed of a single transfer (approx 
40+40MB/sec vs 60MB/sec) however with 6 transfers it's obviously much 
smaller (approx 6x12MB/sec). The bug looks like some kind of a race 
condition to me.

On Samba another strange fact is that if there are multiple transfers 
(i.e. 2) from the same client to the same server (please note that in 
this case with Samba all transfers pass through the same TCP socket) the 
two transfers stall *independently* one to the other. First after a 
random time you see one stalling, and eventually after another random 
time also the other one stalls. This tells that the socket actually 
kinda works... I can even stop one stalled transfer from the client (the 
stop operation *works* and gets a reply from the Samba server: I see 
messages sniffing the Ethernet), and start it again and it would run, 
and the TCP socket is always the same when I restart the transfer.

I obtained apparently reliable transfers with Samba if I introduced rate 
limiting on our SMC 8024L2 gigabit switch . Even if the rate limiting 
was set so high (930Mbps) that the transfer speed was actually not 
influenced, the Samba transfer (1 single transfer, not multiple) 
apparently worked reliably. Maybe this helps with the race condition by 
separating packets better. The SFTP protocol hanged anyway though, even 
on rate limiting, however consider that SFTP from windows goes very slow 
at 15Mbps due to a problem of encryption performance of WinSCP, so that 
one is really way lower than the rate limit and its packets might not 
get separated.

The server is a linux Ubuntu 9.10 with kernel ubuntu 2.6.31-14-server . 
The ethernets are two integrated Intels (I tested both ethernet ports):
Ethernet controller: Intel Corporation 80003ES2LAN Gigabit Ethernet 
Controller (Copper) (rev 01)
the driver is kernel's e1000

This is the stack trace of server processes (Server side) during an SFTP 
stall:

root       327  [CUT] \_ sshd: myuser [priv]
1011       380  [CUT]     \_ sshd: myuser
1011       382  [CUT]         \_ /usr/lib/openssh/sftp-server

root@...torage:/proc# cat /proc/327/stack   (this is sshd)
[<ffffffff814b428a>] unix_stream_data_wait+0xaa/0x110
[<ffffffff814b50cd>] unix_stream_recvmsg+0x36d/0x570
[<ffffffff81426399>] sock_aio_read+0x149/0x150
[<ffffffff8111e232>] do_sync_read+0xf2/0x130
[<ffffffff8111e8e1>] vfs_read+0x181/0x1a0
[<ffffffff8111edec>] sys_read+0x4c/0x80
[<ffffffff81011fc2>] system_call_fastpath+0x16/0x1b
[<ffffffffffffffff>] 0xffffffffffffffff

root@...torage:/proc# cat /proc/380/stack   (this is the second sshd)
[<ffffffff8112e644>] poll_schedule_timeout+0x34/0x50
[<ffffffff8112ef4f>] do_select+0x58f/0x6b0
[<ffffffff8112f8b5>] core_sys_select+0x185/0x2b0
[<ffffffff8112fc32>] sys_select+0x42/0x110
[<ffffffff81011fc2>] system_call_fastpath+0x16/0x1b
[<ffffffffffffffff>] 0xffffffffffffffff

root@...torage:/proc# cat /proc/382/stack (this is sftp-server)
[<ffffffff8112e644>] poll_schedule_timeout+0x34/0x50
[<ffffffff8112ef4f>] do_select+0x58f/0x6b0
[<ffffffff8112f8b5>] core_sys_select+0x185/0x2b0
[<ffffffff8112fc32>] sys_select+0x42/0x110
[<ffffffff81011fc2>] system_call_fastpath+0x16/0x1b
[<ffffffffffffffff>] 0xffffffffffffffff

so everything seems normal, it just seems the socket is waiting for data...

I am really puzzled because... I don't know how the Samba protocol 
works, but for SFTP which I believe is a simple protocol, I believe the 
client is just pushing raw bytes onto a simple TCP socket so it should 
keep doing that while it receives Acks. OTOH if it does not receive Acks 
the socket would disconnect. However in my case during a stall the 
socket stays Established (I checked at BOTH sides!) but the client stops 
pushing bytes to the server. (That's for a stall in the middle of a file 
on SFTP protocol, which *does* happen) How on earth can that be??  
(update: see tcpdump below, it might explain)

I am also thinking at a hardware problem. The server is running on a 
Tyan mainboard S5396WA2NRF and we had another problem with this computer 
(mobo?), apparently hardware related, with 3ware disk controllers, same 
problem as this guy also running another Tyan:
http://www.webhostingtalk.com/archive/index.php/t-449670.html
which I thought might be reconducible to a lost interrupt. Even on a 
lost interrupt I can't understand how to solve the Ack puzzle... this is 
not a TOE network card!

This is what I did try up to now (and didn't help):
   
Changed some Samba parameters trying various values (clearly not useful 
for SFTP btw). I ended up with a pretty common configuration, which 
anyway didn't help:
    socket options = TCP_NODELAY IPTOS_LOWDELAY SO_RCVBUF=8192 
SO_SNDBUF=8192
    read raw = yes
    write raw = yes
    kernel oplocks = yes
    max xmit = 2048
    getwd cache = yes


Changed the following parameters in /proc/sys/net:
    1 -> 0 > ipv6/bindv6only (for Samba but it was bound already and I 
couldn't reboot. SFTP tests were done in IPv4 though)
    1 -> 0 > tcp_sack
    1 -> 0 > tcp_dsack
    0 -> 1 > tcp_low_latency
    2 -> 0 > tcp_frto
    cubic -> reno > tcp_congestion_control
    3 -> 6 > tcp_retries1
    3 -> 10 > tcp_tso_win_divisor
    1 -> 0 > tcp_slow_start_after_idle


Then I disabled the following offloads using ethtool:
    rx-checksumming on --> off
    tx-checksumming on --> off
    scatter-gather on --> off
    tcp-segmentation-offload: on --> off
    generic-segmentation-offload: on --> off

Nothing helped.


This is the SFTP socket when it's running (Server side):
# cat /proc/net/tcp
   2: 0F12A8C0:0016 2612A8C0:C0C6 01 00000000:00000000 02:000AB698 
00000000     0        0 267816 4 ffff88084b1e6180 24 4 1 11 -1

Later on, when it has been stuck for a while:
# cat /proc/net/tcp
   2: 0F12A8C0:0016 2612A8C0:C0C6 01 00000000:00000000 02:0002AC6B 
00000000     0        0 267816 2 ffff88084b1e6180 27 4 1 21 -1
(numbers reference: 
http://www.mjmwired.net/kernel/Documentation/networking/proc_net_tcp.txt )

as you can see from the last values, the timeout and window have 
enlarged a bit, but nothing dramatic. Apparently they are very similar.

This is another SFTP stall after I disabled the offloading via ethtool: 
(different transfer, different socket from above)
  2: 0F12A8C0:0016 2512A8C0:09BC 01 00000000:00000000 02:000602F4 
00000000     0        0 5301100 2 ffff88084ac02080 28 4 0 13 -1
so the values are similar again, apart the ack.pingpong which has become 
0   (something to do with offloading?)


This night I decided to actually tcpdump some traffic at the server 
side. This dump has a chance of being incorrect/misleading because the 
client I could use was already transferring other stuff to the server 
via Samba. I used SFTP protocol for this test but due to the 
simultaneous Samba transfer I could not obtain a long SFTP hangup. The 
longest hangup I obtained is 1 minute. I hope this is really a 
manifestation of the bug in question, and not a simple temporary 
starvation at server side.

Here is the relevant excerpt:
00:54:36.303031 IP 192.168.18.15.22 > 192.168.18.37.3418: Flags [P.], 
seq 12309129:12309229, ack 997119732, win 65535, length 100
00:54:36.303216 IP 192.168.18.37.3418 > 192.168.18.15.22: Flags [P.], 
seq 997119732:997120040, ack 12309229, win 64871, length 308
00:54:36.303450 IP 192.168.18.15.22 > 192.168.18.37.3418: Flags [P.], 
seq 12309229:12309297, ack 997120040, win 65535, length 68
00:54:36.303566 IP 192.168.18.37.3418 > 192.168.18.15.22: Flags [P.], 
seq 997120040:997120228, ack 12309297, win 64803, length 188
00:54:36.303645 IP 192.168.18.15.22 > 192.168.18.37.3418: Flags [P.], 
seq 12309297:12309365, ack 997120228, win 65535, length 68
00:54:36.304414 IP 192.168.18.37.3418 > 192.168.18.15.22: Flags [P.], 
seq 997120228:997120432, ack 12309365, win 64735, length 204
00:54:36.345583 IP 192.168.18.15.22 > 192.168.18.37.3418: Flags [.], ack 
997120432, win 65535, length 0       *** PROBLEM IS HERE ***
00:55:27.171687 IP 192.168.18.15.22 > 192.168.18.37.3418: Flags [P.], 
seq 12309365:12309425, ack 997120432, win 65535, length 60
00:55:27.342682 IP 192.168.18.37.3418 > 192.168.18.15.22: Flags [P.], 
seq 997120432:997120944, ack 12309425, win 64675, length 512
00:55:27.342726 IP 192.168.18.15.22 > 192.168.18.37.3418: Flags [.], ack 
997120944, win 65535, length 0
00:55:27.342775 IP 192.168.18.37.3418 > 192.168.18.15.22: Flags [.], seq 
997120944:997122404, ack 12309425, win 64675, length 1460
00:55:27.342787 IP 192.168.18.15.22 > 192.168.18.37.3418: Flags [.], ack 
997122404, win 65535, length 0
00:55:27.342792 IP 192.168.18.37.3418 > 192.168.18.15.22: Flags [P.], 
seq 997122404:997122612, ack 12309425, win 64675, length 208
00:55:27.342795 IP 192.168.18.15.22 > 192.168.18.37.3418: Flags [.], ack 
997122612, win 65535, length 0

It seems to me the TCP stack at the Server returns an ack to the client 
but waits 1 minutes before being able to give the further 60 bytes to 
the client (these are application-generated bytes, right?) which are 
probably needed at client-side for the functioning of the SFTP protocol.
I suspect the SFTP-server application receives the last client data just 
at the end of the minute of delay (at 00:55:27.1something)
Probably if the dump was made without concomitant Samba activity, the 
sftp-server application would have never received such last client bytes 
and that's why the hangup lasts forever.
However the TCP socket remains established because the TCP stack works 
and acks are sent properly
Could this be a good interpretation of the bug?

I don't know what are those 60 bytes though, can you confirm is it a 
part of the SFTP or SSH protocol?

I observed other similar micro-hangups:
...
00:50:59.210115 IP 192.168.18.15.22 > 192.168.18.37.3418: Flags [.], ack 
3346661444, win 65535, length 0
00:51:14.137978 IP 192.168.18.15.22 > 192.168.18.37.3418: Flags [P.], 
seq 7757313:7757457, ack 3346661444, win 65535, length 144
...
00:52:02.495785 IP 192.168.18.15.22 > 192.168.18.37.3418: Flags [P.], 
seq 8612969:8613037, ack 3711382248, win 65535, length 68
00:52:02.507752 IP 192.168.18.37.3418 > 192.168.18.15.22: Flags [P.], 
seq 3711382248:3711382452, ack 8613037, win 65399, length 204
00:52:02.545585 IP 192.168.18.15.22 > 192.168.18.37.3418: Flags [.], ack 
3711382452, win 65535, length 0
00:52:14.568489 IP 192.168.18.15.22 > 192.168.18.37.3418: Flags [P.], 
seq 8613037:8613097, ack 3711382452, win 65535, length 60
00:52:14.572162 IP 192.168.18.37.3418 > 192.168.18.15.22: Flags [P.], 
seq 3711382452:3711382964, ack 8613097, win 65339, length 512
00:52:14.572184 IP 192.168.18.15.22 > 192.168.18.37.3418: Flags [.], ack 
3711382964, win 65535, length 0
00:52:14.572312 IP 192.168.18.37.3418 > 192.168.18.15.22: Flags [.], seq 
3711382964:3711384424, ack 8613097, win 65339, length 1460
...
00:55:30.464957 IP 192.168.18.15.22 > 192.168.18.37.3418: Flags [.], ack 
1040635920, win 65535, length 0
00:55:32.230205 IP 192.168.18.15.22 > 192.168.18.37.3418: Flags [P.], 
seq 12410749:12410809, ack 1040635920, win 65535, length 60
...
00:55:54.404027 IP 192.168.18.15.22 > 192.168.18.37.3418: Flags [.], ack 
1362376840, win 65535, length 0
00:56:08.349224 IP 192.168.18.15.22 > 192.168.18.37.3418: Flags [P.], 
seq 13157437:13157497, ack 1362376840, win 65535, length 60
...
00:56:28.494333 IP 192.168.18.15.22 > 192.168.18.37.3418: Flags [.], ack 
1424904580, win 65535, length 0
00:56:31.450352 IP 192.168.18.15.22 > 192.168.18.37.3418: Flags [P.], 
seq 13307333:13307393, ack 1424904580, win 65535, length 60

Update: actually I just found a disk performance problem at Server side 
so that the disks couldn't go faster than 70MB/sec (stripe_cache_size 
was too low and readahead was too low), and I'm thinking if this could 
be the reason for indeed triggering the bug: if there was a starvation 
of the TCP/IP stack or network card driver at the wrong time (because 
i.e. the CPUs are doing something else like accessing the disks), could 
that prevent the 60-bytes packet from ever being transmitted...? (in 
case of no other incoming ethernet packets which can give the TCPIP or 
e1000 driver a second chance to run)

Thanks for your help
Asdo
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html