[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-Id: <1167872389.8646.25.camel@w-sridhar2.beaverton.ibm.com>
Date: Wed, 03 Jan 2007 16:59:49 -0800
From: Sridhar Samudrala <sri@...ibm.com>
To: Andrew Morton <akpm@...l.org>
Cc: netdev@...r.kernel.org, Steve Hill <steve.hill@...logic.com>,
lksctp-developers@...ts.sourceforge.net
Subject: Re: Fw: Intermittent SCTP multihoming breakage
On Wed, 2007-01-03 at 15:46 -0800, Andrew Morton wrote:
>
> Begin forwarded message:
>
> Date: Wed, 3 Jan 2007 11:54:26 +0000
> From: Steve Hill <steve.hill@...logic.com>
> To: Linux Kernel Mailing List <linux-kernel@...r.kernel.org>
> Subject: Intermittent SCTP multihoming breakage
>
>
>
> Apologies if I'm posting to the wrong list - the lksctp lists seem to be a
> bit dead these days and a bit of Googling seemed to inidicate that SCTP
> developemnt discussions might have moved here.
No. lksctp-developers mailing list is still the best place for SCTP related
discussions. You can subscribe and look in the archives at
http://lists.sourceforge.net/lists/listinfo/lksctp-developers
>
> I'm running under the 2.6.16.1 kernel and have an intermittent problem
> with the SCTP stack. Having reviewed the git logs I can't see any
> indication that the problem has been fixed in more recent kernels, but it
> is very difficult to test since it is so intermittent.
If possible, i would suggest moving to the latest mainline 2.6.19.
But 2.6.16.1 should work OK for simple multihoming cases.
>
> I am running a multihomed connection between 2 machines, (2 NICs on
> each machine, so 2 paths for the connection) and tcpdump shows heartbeat
> requests and acks on both paths. Putting data over the link correctly
> sends it over the first path.
How are the 2 machines connected? Are they connected directly or
via a router?
Do you see both the addresses when you do cat /proc/net/sctp/assocs
after the association is established on both the peers?
>
> If I drop the traffic on one of the NICs then most of the time it
> correctly fails over the the second path and I see the data being sent
> and acknowledged correctly on the second path. However, I also
> intermittently see two failure conditions:
How are you dropping traffic? You could try simulating failover by
bringing down the interface or physically removing the link.
>
> 1. Sometimes, just after failing over to the second path I see an ABORT.
This seems to indicate that somehow the app has terminated.
> 2. More frequently, the association stays up indefinately, with heartbeat
> requests and acks on the second path, but no data chunks are sent even
> though the transmit queue on the transmitting end appears to be full and
> the socket is blocking writes.
This is strange. Can you collect tcpdump traces on sender and receiver when
this happens?
Thanks
Sridhar
>
> I have been adding debugging to the kernel in an attempt to track down the
> source of the second failure condition, and I am wondering if anyone else
> has seen similar behaviour?
>
> --
> - Steve Hill
> Software Engineer
> Dialogic
> Fordingbridge, Hampshire, UK
> +44-1425-651392
> steve.hill@...logic.com
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@...r.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Powered by blists - more mailing lists