netdev - Re: [PATCH net-next 2/3] tipc: byte-based overload control on socket receive queue

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <5125F5D3.1000509@ericsson.com>
Date:	Thu, 21 Feb 2013 11:24:19 +0100
From:	Jon Maloy <jon.maloy@...csson.com>
To:	Neil Horman <nhorman@...driver.com>
CC:	Paul Gortmaker <paul.gortmaker@...driver.com>,
	David Miller <davem@...emloft.net>, <netdev@...r.kernel.org>,
	Ying Xue <ying.xue@...driver.com>,
	Jon Maloy <maloy@...jonn.com>
Subject: Re: [PATCH net-next 2/3] tipc: byte-based overload control on socket
 receive queue

On 02/19/2013 10:44 PM, Neil Horman wrote:
> On Tue, Feb 19, 2013 at 09:16:40PM +0100, Jon Maloy wrote:
>> On 02/19/2013 08:18 PM, Neil Horman wrote:
>>> On Tue, Feb 19, 2013 at 06:54:14PM +0100, Jon Maloy wrote:
>>>> On 02/19/2013 03:26 PM, Neil Horman wrote:
>>>>> On Tue, Feb 19, 2013 at 09:07:54AM +0100, Jon Maloy wrote:
>>>>>> On 02/18/2013 09:47 AM, Neil Horman wrote:
>>>>>>> On Fri, Feb 15, 2013 at 05:57:46PM -0500, Paul Gortmaker wrote:
>>>>>>>> From: Ying Xue <ying.xue@...driver.com>
>> <snip>
> 
>>>> 	
>>>> There are two reasons for this. 
>>>> The first one due to the message oriented nature of the flow control for 
>>>> connections. Max message size is 65k, and max number of unacked messages 
>>>> (at socket level, that is) before the sending process will take a break 
>>>> is 1024. 
>>>> So, simple maths gives that we must allow for 64MB + sk_overhead to guarantee 
>>>> that a connection never is broken because of receiver overload. Contrary to TCP,
>>> Note, this is false, due to the fact that, it presumes that the sender will
>>> honor the congestion window.  Granted, that would be a sender violation, but its
>>> still possible for packets to get lost due to receiver overrun.  
>>
>> The reason for this high limit is exactly to guard against crazy or malevolent 
>> senders. If they respect their send window they will never hit this limit on
>> connections.
>>
> Nope, You don't get to have it both ways - If a sender is malevolent or crazy,
> what makes you think it will respect its send window?

Nothing. That is why we need this extra security measure that this limit provides.
The normal senders will never hit it, the crazy ones will, and have their 
connections broken as a consequence. 
Sorry I didn't express this clearly enough.

> 
>> The fact that
>>> you ack packets before accepting them to the receive queue is the problem that
>>> needs fixing in this case, but that looks like it can be easily accomplished
>>> (see below).
>>>
[...]
> No, I get that, but the fact that TIPC has a shared retransmit queue between all
> sockets on the link isn't an excuse to violate the limits set on an individual
> socket.
> 
>> Granted, the retransmission will take care of the dropped packet, but in the 
>> meantime no other packets can be delivered through from that link, to any 
>> socket. Basically, all TIPC communication between the two involved
>> nodes in the given direction would grind to a halt until the slow or overwhelmed 
>> receiver process has decided to work off his receive queue, something that may 
>> never happen if it is faulty.
> Sounds like a bug.  That should be fixed.

This is not a bug, but an inherent  property of any protocol providing
sequential, guaranteed delivery of packets, TCP inclusive.
If you lose/drop a packet in the sequence, no subsequent packets in
the stream can be delivered until the missing one has been retransmitted and
delivered. To make this delivery dependent of the whims of each and any of the 
potentially hundreds of receiving processes is simply not a good idea.

I could even show that your proposal would cause almost immediate
deadlock, due to the order incoming and outgoing data paths are grabbing 
the locks. But I would rather drop this part of the discussion; we can 
achieve what we want anyway, with much simpler means. See below.

> 
>> You may see this as a flaw, but it is a consequence of that TIPC is a radically 
>> different design than IP based protocols, designed for completely different needs.
>> The shared link concept is a fundamental feature that has a lot of other advantages 
>> which I won't elaborate on here.
>>
> Very well.  While I'm thinking of it though, you also seem to be making a large
> leap in reasoning - you seem to be repeatedly equating my request to have you
> honor the limits of sk_rcvbuf, with a desire to have that limit be smaller than
> what you currently have it set to.  

Not at all. Just like you, I am trying to find a way to tune up the defaults 
and the the limits using existing mechanisms, so that we can honour the 
nominal value of sk_rcvbuf. 
I think what you suggested may be a way to achieve this, but I am a little 
worried about side effects. I'll explain further down.

> Thats absolutely untrue.  I don't care if
> you set your limit to UINT_MAX/2 on a socket, I just want what you set to be

[...]

>>
>> Ok, I wasn't aware of that. Now, if we could set these parameters from inside 
>> the module, when a socket is created, we I think we have what we need. We 
>> don't want to force every socket creator to set these limits explicitly, unless
>> he has some very particular needs and knows what he is doing.
>>
> I've tried to explain this several times now.  You don't have to have set this value to
> what you want programatically, nor do you need to force the value from within
> the module code itself, you can do it administratively.  

This is exactly what I want to avoid. To force the users to set special parameters
during boot up just to make TIPC work would become another obstacle for the its 
adoption. TIPC is extremely easy to install and use, and it should remain so.

A programmatic solution is much better, if we can solve it from inside the module,
and this is what I am convinced that we can.
The following is an approach we discussed (inside the team) earlier, but dropped
due to the perceived impossibility to set the sk_rcvbuf to the desired values.
Now this limitation is not there, as I understand, and we can revive our 
proposal.

1: When a TIPC socket is created, sk_rcvbuf is set to 64MB+ (as it was before the 
   latest patch). This is the limit used as last resort against connected peers
   not respecting the connection send window.

2: Datagram messages are checked against fractions of this value, according to
   their importance priority. E.g. LOW->sk_rcvbuf/16,  MEDIUM->sk_rcvbuf/8
   HIGH->sk_rcvbuf/4 and CRITICAL->sk_rcvbuf/2.

3: In the unlikely event that anybody wants to change these limits, he can change
   sk_rcvbuf via setsockopt(SOL_SOCK) either by first changing
   /proc/sys/net/core/rmem_max as you suggest, or via a dedicated setsockopt(SOL_TIPC).
   The latter option would have the advantage that it could enforce a lower limit of
   skr_rcvbuf of 64MB, something that would be unthinkable with SOL_SOCK, because it 
   would obviously affect other protocols.

What do you think?

///jon

If you set:
> 
> /proc/sys/net/core/rmem_default

But this will 
> 
> to the value that you want all your sockets to have, any socket that gets
> initalized with sock_init_data will inherit that value.  Note that, when doing
> so, you may also have to set:
> 
> 
> 
> As you can't adminstratively set your default socket rcvbuf value to something
> larger than the maximum allowed value without raising the maximum allowed value
> first.
> 
> Then all you have to do is make sure those values are set during boot up, and for
> users, it will appear as though nothing has changed.
> 
> Neil
> 
>>

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html