netdev - Re: Fw: Intermittent SCTP multihoming breakage

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <45A55151.2060401@hp.com>
Date:	Wed, 10 Jan 2007 15:49:21 -0500
From:	Vlad Yasevich <vladislav.yasevich@...com>
To:	Steve Hill <steve.hill@...logic.com>
Cc:	Sridhar Samudrala <sri@...ibm.com>, Andrew Morton <akpm@...l.org>,
	netdev@...r.kernel.org, lksctp-developers@...ts.sourceforge.net
Subject: Re: Fw: Intermittent SCTP multihoming breakage

Steve Hill wrote:
> On Wed, 3 Jan 2007, Sridhar Samudrala wrote:
> 
> Sorry for the delay in replying.
> 
>> No. lksctp-developers mailing list is still the best place for SCTP related
>> discussions. You can subscribe and look in the archives at
>>   http://lists.sourceforge.net/lists/listinfo/lksctp-developers
> 
> Hmm, I had a look there and it seemed reasonably inactive and overrun by
> spam.. (And I've been unable to subscribe).
> 
>> How are the 2 machines connected? Are they connected directly or
>> via a router?
> 
> They are currently connected together directly through crossover cables.
> 
>> Do you see both the addresses when you do cat /proc/net/sctp/assocs
>> after the association is established on both the peers?
> 
> Yes, the contents of /proc/net/sctp/assocs looks correct.
> 
>> How are you dropping traffic? You could try simulating failover by
>> bringing down the interface or physically removing the link.
> 
> I have been using iptables to drop SCTP packets on both the INPUT and
> OUTPUT chains.  However, I get the same results if I just unplug the
> network cable (using iptables is easier for my testing since I don't have
> to crawl around behind the test systems :)
> 
>>> 1. Sometimes, just after failing over to the second path I see an ABORT.
>> This seems to indicate that somehow the app has terminated.
> 
> The abort _appears_ to be caused by a retransmit timer expiring, causing
> the SCTP stack to tear down the association.  However, I haven't done much
> investigation of this problem yet - I've been focussing on the second
> problem since it seems to happen more frequently.
> 
>>> 2. More frequently, the association stays up indefinately, with heartbeat
>>> requests and acks on the second path, but no data chunks are sent even
>>> though the transmit queue on the transmitting end appears to be full and
>>> the socket is blocking writes.
>> This is strange. Can you collect tcpdump traces on sender and receiver when
>> this happens?
> 
> I've taken dumps of the data on the wire for both paths:
>   http://www.nexusuk.org/~steve/sctp/path1.pcap
>   http://www.nexusuk.org/~steve/sctp/path2.pcap

Taking a look at these it does appear to complete stall...  There are some
rather interesting retransmission that don't look quite right...

> 
> I can't see anything odd in the network traffic - it just stops as if it
> has no more data to send.  However, the socket appears to still be
> blocking so the application cannot give it any new data.
> 
> This seems to be a problem with the abandonment functionality:
> 1. Transmit chunk 1.  The transmitted list now contains chunk 1.
> 2. Chunk 1 and it's retransmissions get lost on the network.
> 3. Abandon chunk 1.  The transmitted list is now empty.

This causes a FORWARD TSN chunk to be sent to the peer telling him
to advance CTSN to that of chunk 1.

> 4. Transmit chunk 2.  the transmitted list now contains chunk 2
> 5. Receive a gap-ack for chunk 2, indicating that chunk 1 is missing.

Yes, but at this point, we will regenerate the FORWARD TSN since chunk1
is still on the abandoned list.

> At this point, the T3 timer is disabled at the bottom of
> sctp_check_transmitted() since all the chunks in the transmitted queue are
> gap-acked.  The whole connection now stalls, waiting for the SACK for
> chunk 1 that will never arrive.
> 

I'll look some more at this...

-vlad
> It should be noted that this is not unordered data and I'm not clear on
> how abandoned chunks are supposed to be handled - I hadn't intentionally
> enabled the abandonment functionality, the timetolive was set on the
> transmitted chunks by accident.
> 

-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html