[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-Id: <D70D2636-7C49-48A5-B4DC-B9583A448415@earthlink.net>
Date: Tue, 15 Jun 2010 20:30:59 -0700
From: Mitchell Erblich <erblichs@...thlink.net>
To: Mitchell Erblich <erblichs@...thlink.net>
Cc: Eric Dumazet <eric.dumazet@...il.com>, netdev@...r.kernel.org
Subject: Re: Proposed linux kernel changes : scaling tcp/ip stack : 2nd part
On Jun 15, 2010, at 8:11 PM, Mitchell Erblich wrote:
>
> On Jun 3, 2010, at 2:14 AM, Eric Dumazet wrote:
>
>> Le jeudi 03 juin 2010 à 01:16 -0700, Mitchell Erblich a écrit :
>>> To whom it may concern,
>>>
>>> First, my assumption is to keep this discussion local to just a few tcp/ip
>>> developers to see if there is any consensus that the below is a logical
>>> approach. Please also pass this email if there is a "owner(s)" of this stack
>>> to identify if a case exists for the below possible changes.
>>>
>>> I am not currently on the linux kernel mail group.
>>>
>>> I have experience with modifications of the Linux tcp/ip stack, and have
>>> merged the changes into the company's local tree and left the possible
>>> global integration to others.
>>>
>>> I have been approached by a number of companies about scaling the
>>> stack with the assumption of a number of cpu cores. At present, I find extra
>>> time on my hands and am considering looking into this area on my own.
>>>
>>> The first assumption is that if extra cores are available, that a single
>>> received homogeneous flow of a large number of packets/segments per
>>> second (pps) can be split into non-equal flows. This split can in effect
>>> allow a larger recv'd pps rate at the same core load while splitting off
>>> other workloads, such as xmit'ing pure ACKs.
>>>
>>> Simply, again assuming Amdahl's law (and not looking to equalize the load
>>> between cores), and creating logical separations where in a many core
>>> system, different cores could have new kernel threads that operate in
>>> parallel within the tcp/ip stack. The initial separation points would be at
>>> the ip/tcp layer boundry and where any recv'd sk/pkt would generate some
>>> form of output.
>>>
>>> The ip/tcp layer would be split like the vintage AT&T STREAMs protocol,
>>> with some form of queuing & scheduling, would be needed. In addition,
>>> the queuing/schedullng of other kernel threads would occur within ip & tcp
>>> to separate the I/O.
>>>
>>> A possible validation test is to identify the max recv'd pps rate within the
>>> tcp/ip modules within normal flow TCP established state with normal order
>>> of say 64byte non fragmented segments, before and after each
>>> incremental change. Or the same rate with fewer core/cpu cycles.
>>>
>>> I am willing to have a private git Linux.org tree that concentrates proposed
>>> changes into this tree and if there is willingness, a seen want/need then identify
>>> how to implement the merge.
>>
>> Hi Mitchell
>>
>> We work everyday to improve network stack, and standard linux tree is
>> pretty scalable, you dont need to setup a separate git tree for that.
>>
>> Our beloved maintainer David S. Miller handles two trees, net-2.6 and
>> net-next-2.6 where we put all our changes.
>>
>> http://git.kernel.org/?p=linux/kernel/git/davem/net-next-2.6.git
>> git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next-2.6.git
>>
>> I suggest you read the last patches (say .. about 10.000 of them), to
>> have an idea of things we did during last years.
>>
>> keywords : RCU, multiqueue, RPS, percpu data, lockless algos, cache line
>> placement...
>>
>> Its nice to see another man joining the team !
>>
>> Thanks
>>
>
>
> Lets start with a two part Linux kernel change and a tcp input/output change:
>
> 2 Parts: 2nd part TBD
>
> Summary: Don't use last free pages for TCP ACKs with GFP_ATOMIC for our
> sk buf allocs. 1 line change in tcp_output.c with a new gfp.h arg, and a change
> in the generic kernel. TBD.
>
> This change should have no effect with normal available kernel mem allocs.
>
> Assuming memory pressure ( WAITING for clean memory) we should be allocating
> our last pages for input skbufs and not for xmit allocs.
>
> By delaying skbuf allocations when we have low kmem, we secondarily slow down the
> tcp flow : if in slow start (SS) we are almost doing a DELACK, else CA should/could
> decrease the number of in-flight ACKs and the peer should do burst avoidance
> if our later ack increases the window in a larger chunk..
>
> And use the last pages to decrease the chance of dropping a input pkt or
> running out of recv descriptors, because of mem back pressure.
>
> The change could check for some form of mem pressure before the alloc,
> but the alloc in itself should suffice. We could also do a ECN type check before
> the alloc.
>
> Now the kicker. I want a GFP_KERNEL with NO_SLEEP OR a GFP_ATOMIC and
> NOT use emergency pools, thus CAN FAIL, to have 0 other secondary effects
> and change just the 1 arg.
>
> code : tcp_output.c : tcp_send_ack()
> line : buff = alloc_skb(MAX_TCP_HDR, GFP_KERNEL_NSLEEP); /* with a NO SLEEP */
>
> Suggestions, feedback??
>
> Mitchell Erblich
>
>
>
>
Sorry :),
2nd part:
use GFP_NOWAIT as 2nd arg to alloc_skb()
Mitchell Erblich
>
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe netdev" in
>> the body of a message to majordomo@...r.kernel.org
>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Powered by blists - more mailing lists