[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <Pine.CYG.4.58.0701040913300.3128@shill1-mobl.eicon.com>
Date: Wed, 10 Jan 2007 11:55:58 +0000
From: Steve Hill <steve.hill@...logic.com>
To: Sridhar Samudrala <sri@...ibm.com>
cc: Andrew Morton <akpm@...l.org>, netdev@...r.kernel.org,
lksctp-developers@...ts.sourceforge.net
Subject: Re: Fw: Intermittent SCTP multihoming breakage
On Wed, 3 Jan 2007, Sridhar Samudrala wrote:
Sorry for the delay in replying.
> No. lksctp-developers mailing list is still the best place for SCTP related
> discussions. You can subscribe and look in the archives at
> http://lists.sourceforge.net/lists/listinfo/lksctp-developers
Hmm, I had a look there and it seemed reasonably inactive and overrun by
spam.. (And I've been unable to subscribe).
> How are the 2 machines connected? Are they connected directly or
> via a router?
They are currently connected together directly through crossover cables.
> Do you see both the addresses when you do cat /proc/net/sctp/assocs
> after the association is established on both the peers?
Yes, the contents of /proc/net/sctp/assocs looks correct.
> How are you dropping traffic? You could try simulating failover by
> bringing down the interface or physically removing the link.
I have been using iptables to drop SCTP packets on both the INPUT and
OUTPUT chains. However, I get the same results if I just unplug the
network cable (using iptables is easier for my testing since I don't have
to crawl around behind the test systems :)
> > 1. Sometimes, just after failing over to the second path I see an ABORT.
> This seems to indicate that somehow the app has terminated.
The abort _appears_ to be caused by a retransmit timer expiring, causing
the SCTP stack to tear down the association. However, I haven't done much
investigation of this problem yet - I've been focussing on the second
problem since it seems to happen more frequently.
> > 2. More frequently, the association stays up indefinately, with heartbeat
> > requests and acks on the second path, but no data chunks are sent even
> > though the transmit queue on the transmitting end appears to be full and
> > the socket is blocking writes.
> This is strange. Can you collect tcpdump traces on sender and receiver when
> this happens?
I've taken dumps of the data on the wire for both paths:
http://www.nexusuk.org/~steve/sctp/path1.pcap
http://www.nexusuk.org/~steve/sctp/path2.pcap
I can't see anything odd in the network traffic - it just stops as if it
has no more data to send. However, the socket appears to still be
blocking so the application cannot give it any new data.
This seems to be a problem with the abandonment functionality:
1. Transmit chunk 1. The transmitted list now contains chunk 1.
2. Chunk 1 and it's retransmissions get lost on the network.
3. Abandon chunk 1. The transmitted list is now empty.
4. Transmit chunk 2. the transmitted list now contains chunk 2
5. Receive a gap-ack for chunk 2, indicating that chunk 1 is missing.
At this point, the T3 timer is disabled at the bottom of
sctp_check_transmitted() since all the chunks in the transmitted queue are
gap-acked. The whole connection now stalls, waiting for the SACK for
chunk 1 that will never arrive.
It should be noted that this is not unordered data and I'm not clear on
how abandoned chunks are supposed to be handled - I hadn't intentionally
enabled the abandonment functionality, the timetolive was set on the
transmitted chunks by accident.
--
- Steve Hill
Software Engineer
Dialogic
Fordingbridge, Hampshire, UK
+44-1425-651392
steve.hill@...logic.com
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Powered by blists - more mailing lists