netdev - Re: After many hours all outbound connections get stuck in SYN

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Date:	Thu, 20 Dec 2007 11:37:39 -0500
From:	"James Nichols" <jamesnichols3@...il.com>
To:	"Glen Turner" <gdt@....id.au>
Cc:	"Jan Engelhardt" <jengelh@...putergmbh.de>,
	"Eric Dumazet" <dada1@...mosbay.com>, linux-kernel@...r.kernel.org,
	"Linux Netdev List" <netdev@...r.kernel.org>
Subject: Re: After many hours all outbound connections get stuck in SYN_SENT

> But I'd be very surprised if the router is acting as anything more
> that a network-layer device. It might perhaps have some soft connection
> state being used for generating accounting records.  Being Cisco
> it's probably a switch-router, so it might carry some per-port hard
> state for validating source IP addresses and ARPs on each port.
>
> The firewall is much more likely to be carrying per-flow Sack
> state. The Cisco PIX had a bug with SACK handling (CSCse14419,
> fixed in 7.0(7), 7.1(2.34), 7.2(2.2), 8.0(0.141) but perhaps it
> has regressed). A simple trace either side of the firewall will
> show the inconsistency between the TCP sequence number (which
> gets randomised) and the Sack sequence number (which didn't).
> You could disable the TCP Sequence Number Randomisation feature
> and see if the fault reoccurs.

I do have TCP Sequence # Randomization enabled on my router.  However,
if this was causing an issue, wouldn't it always occur and cause
connection issues, not just after 38 hours of correct operation?  I
can look into turning this off, but I'll likely have to jump through
several hoops which will be challenging if I don't have a very clear
definitive reason why this is causing this issue.  Plus, I've had this
problem with at least 2 other sets of network switches over the past 4
years.  I'm actually running 7.0(6), which doesn't have the fix you
mentioned.  If it really is possible that this issue wouldn't always
cause problems, but only after hours of succesful operation, then I
could probably motivate the upgrade.  I can try to setup a trace, but
this is a lot of work for other people in my organization, so it will
take quite some time.


> You'd probably should also investigate the Linux kernel,
> especially the size and locks of the components of the Sack data
> structures and what happens to those data structures after Sack is
> disabled (presumably the Sack data structure is in some unhappy
> circumstance, and disabling Sack allows the data to be discarded,
> magically unclaging the box).
>
> In the absence of the reporter wanting to dump the kernel's
> core, how about a patch to print the Sack datastructure when
> the command to disable Sack is received by the kernel?
> Maybe just print the last 16b of the IP address?

Given the fact that I've had this problem for so long, over a variety
of networking hardware vendors and colo-facilities, this really sounds
good to me.  It will be challenging for me to justify a kernel core
dump, but a simple patch to dump the Sack data would be do-able.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html