lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20080724060448.GA10203@elte.hu>
Date:	Thu, 24 Jul 2008 08:04:48 +0200
From:	Ingo Molnar <mingo@...e.hu>
To:	Willy Tarreau <w@....eu>, "David S. Miller" <davem@...emloft.net>
Cc:	David Newall <davidn@...idnewall.com>,
	Linus Torvalds <torvalds@...ux-foundation.org>,
	akpm@...ux-foundation.org, netdev@...r.kernel.org,
	linux-kernel@...r.kernel.org,
	Stefan Richter <stefanr@...6.in-berlin.de>,
	"Rafael J. Wysocki" <rjw@...k.pl>
Subject: [TCP bug, regression] stuck distcc connections in latest -git


* Ingo Molnar <mingo@...e.hu> wrote:

> Alas, the problem has not reoccured since then - more than a thousand 
> kernel builds down the line. [...]

the permanently hug distcc kernel build bug triggered again, twice. 
First time it happened yesterday, i left it running overnight and it 
never recovered after a 14+ hours of wait.

It shows a similar pattern, 'ESTABLISHED' state on both sides, but the 
client-side is stuck and the server (running latest kernel) is seemingly 
clueless about that fact:

 client:

  Proto Recv-Q Send-Q Local Address       Foreign Address     State
  tcp        0 375450 10.0.1.16:39201     10.0.1.19:3632      ESTABLISHED

 server:

  Proto Recv-Q Send-Q Local Address       Foreign Address     State
  tcp        0      0 10.0.1.19:3632      10.0.1.16:39201     ESTABLISHED

i waited ~30 minutes in this second case.

the client (running 2.6.24) does periodic 120 seconds retransmits:

07:40:48.255452 IP dione.39201 > phoenix.distcc: . 1608:2144(536) ack 1 win 584007:40:48.255547 IP phoenix.distcc > dione.39201: . ack 2144 win 65535
07:40:48.255564 IP dione.39201 > phoenix.distcc: . 67143:67679(536) ack 1 win 5840
07:40:48.255648 IP phoenix.distcc > dione.39201: . ack 2144 win 65535
07:42:48.255440 IP dione.39201 > phoenix.distcc: . 2144:2680(536) ack 1 win 5840
07:42:48.255559 IP phoenix.distcc > dione.39201: . ack 2680 win 65535
07:42:48.255570 IP dione.39201 > phoenix.distcc: . 67679:68215(536) ack 1 win 5840
07:42:48.255659 IP phoenix.distcc > dione.39201: . ack 2680 win 65535
07:44:48.255436 IP dione.39201 > phoenix.distcc: . 2680:3216(536) ack 1 win 584007:44:48.255570 IP phoenix.distcc > dione.39201: . ack 3216 win 65535
07:44:48.255585 IP dione.39201 > phoenix.distcc: . 68215:68751(536) ack 1 win 5840
07:44:48.255669 IP phoenix.distcc > dione.39201: . ack 3216 win 65535

the server (running the latest kernel) responds:

07:40:47.551098 IP dione.39201 > phoenix.distcc: . 1072:1608(536) ack 1 win 584007:40:47.551141 IP phoenix.distcc > dione.39201: . ack 1608 win 65535
07:40:47.551204 IP dione.39201 > phoenix.distcc: . 66607:67143(536) ack 1 win 5840
07:40:47.551213 IP phoenix.distcc > dione.39201: . ack 1608 win 65535
07:42:47.570994 IP dione.39201 > phoenix.distcc: . 1608:2144(536) ack 1 win 584007:42:47.571027 IP phoenix.distcc > dione.39201: . ack 2144 win 65535
07:42:47.571117 IP dione.39201 > phoenix.distcc: . 67143:67679(536) ack 1 win 5840
07:42:47.571127 IP phoenix.distcc > dione.39201: . ack 2144 win 65535
07:44:47.590901 IP dione.39201 > phoenix.distcc: . 2144:2680(536) ack 1 win 584007:44:47.590960 IP phoenix.distcc > dione.39201: . ack 2680 win 65535
07:44:47.591042 IP dione.39201 > phoenix.distcc: . 67679:68215(536) ack 1 win 5840
07:44:47.591054 IP phoenix.distcc > dione.39201: . ack 2680 win 65535

full client socket state:

 dione:~> grep $(printf "%X\n" 39201) /proc/net/tcp
   44: 1001000A:9921 1301000A:0E30 01 0005ABF2:00000000 01:00002B8A 
       00000000   500        0 63130083 2 ffff81000c762d00 120000 0 0 28 101

 [ a few minutes later ]

   44: 1001000A:9921 1301000A:0E30 01 0005A392:00000000 01:00002BF0 
       00000000   500        0 63130083 2 ffff81000c762d00 120000 0 0 32 101

 [ i.e. the tx queue did increase by 2144 bytes - 4x 536 bytes ]

full server socket state:

 phoenix:~> grep $(printf "%X\n" 39201) /proc/net/tcp
    6: 1301000A:0E30 1001000A:9921 01 00000000:00000000 00:00000000 
       00000000    99        0 728382 1 ffff88042d8db280 300 4 30 2 -1

 [ a few minutes later ]

    6: 1301000A:0E30 1001000A:9921 01 00000000:00000000 00:00000000 
       00000000    99        0 728382 1 ffff88042d8db280 300 4 30 2 -1

 [ i.e. no change - no pending packets ]

I've started a longer capture session as well - it seems the TCP stack 
is slowly cycling through retransmissions of 536-byte packets, with 
375450 bytes pending? At 120 seconds a pop that would be about 23 hours 
to make any progress on - but i'm not sure i interpreted that right. It 
all looks very weird.

The timestamps of the two boxes are synced up to within about 1 second:

  earth4:~> for N in dione phoenix; do ssh $N date; done
  Thu Jul 24 07:44:02 CEST 2008
  Thu Jul 24 07:44:02 CEST 2008

( but the two boxes are responding to each other fine, so ordering of 
  events is not a question here. )

Any other state you'd like to see before i continue with -tip testing? 

	Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ