netdev - Re: [bug] stuck localhost TCP connections, v2.6.26-rc3+

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <a01a16b50805310725x4f1c375ep2d091bb044ac0787@mail.gmail.com>
Date:	Sat, 31 May 2008 16:25:47 +0200
From:	"Håkon Løvdal" <hlovdal@...il.com>
To:	linux-kernel@...r.kernel.org, netdev@...r.kernel.org
Cc:	"Ingo Molnar" <mingo@...e.hu>,
	"David S. Miller" <davem@...emloft.net>,
	"Rafael J. Wysocki" <rjw@...k.pl>,
	"Andrew Morton" <akpm@...ux-foundation.org>,
	"Ilpo Järvinen" <ilpo.jarvinen@...sinki.fi>
Subject: Re: [bug] stuck localhost TCP connections, v2.6.26-rc3+

2008/5/28 Peter Zijlstra <peterz@...radead.org>:
> Just a quick note to say, me too!!
>
> same scenario: distcc on localhost.

Me too, however with a completely different scenario; my hung connections
are not related to distcc at all. The output from /proc/net/tcp that Ingo
posted a few days ago are somewhat different from mine, however I believe
this is the same problem or at least related. Just as Ingo experienced,
netstat -p only shows PID/program as '-' for the hung connections while
for other connections it shows the expected results.

I have recently bought a new PC and have started the process of copying
stuff from my old PC to the new PC. During this I have experienced this
hang several times. I started copying by using tar on both ends over a ssh
pipe but in order to eliminate possible ssh problems I also have tried tar
over a ttcp connection which also fails. There is no obvious pattern of
when this happens, I have experienced failures after transferring 1.15GB,
51.4GB and 23.6GB.

Here is the output from netstat -n -o filtered for port 22 and slightly
edited. All the lines started with Proto == tcp and Recv-Q == 0.

Send-Q Local Addr Foreign Addr  State       Timer
     0 old_pc:22  new_pc:52667  ESTABLISHED keepalive (3513.93/0/0)
     0 old_pc:22  new_pc:43825  ESTABLISHED keepalive (5467.38/0/0)
  2896 old_pc:22  new_pc:58601  ESTABLISHED on (21020884.65/0/0)
  4344 old_pc:22  new_pc:54105  ESTABLISHED on (21017016.33/0/0)
  2896 old_pc:22  new_pc:34149  ESTABLISHED on (20986889.24/0/0)

The first two connections are ongoing, working, interactive ssh
connections. The other three connections died days ago on my new PC.

One thing that caught my eyes was these very high timer values.
Checking the netstat source reveals that the value printed is "(double)
time_len / HZ" and that time_len is extracted from /proc/net/tcp. While
my CONFIG_HZ is 1000, I assume netstat has picked up HZ as 100 from
/usr/include/asm/param.h, and then things really seems to imply that
there is some integer overflow since 2^31 = 2147483648.

Looking into get_tcp4_sock in net/ipv4/tcp_ipv4.c I see that timer_expires
is initialized with icsk->icsk_timeout for the troublesome cases. But
here my competence to trace this further stops, so I have no idea of
how icsk->icsk_timeout gets such high values.

My old PC is currently still running with these stalled connections
present so let me know if there is something I should try to investigate
further. I can post output from /proc/net/tcp and my .config if you want
to have a look. My old PC is 32 bit/Celeron single core, kernel 2.6.24,
while my new is 64 bit/Q9300 quad core, kernel 2.6.25.3. The ethernet
cards are the following:

02:0d.0 Ethernet controller: Realtek Semiconductor Co., Ltd.
RTL-8139/8139C/8139C+ (rev 10)
02:00.0 Ethernet controller: Marvell Technology Group Ltd. 88E8056
PCI-E Gigabit Ethernet Controller (rev 12)

BR Håkon Løvdal