[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-Id: <97746864-ED54-4A12-AFE7-752AA6E41CDD@earthlink.net>
Date: Tue, 15 Jun 2010 20:11:42 -0700
From: Mitchell Erblich <erblichs@...thlink.net>
To: Eric Dumazet <eric.dumazet@...il.com>
Cc: netdev@...r.kernel.org
Subject: Re: Proposed linux kernel changes : scaling tcp/ip stack
On Jun 3, 2010, at 2:14 AM, Eric Dumazet wrote:
> Le jeudi 03 juin 2010 à 01:16 -0700, Mitchell Erblich a écrit :
>> To whom it may concern,
>>
>> First, my assumption is to keep this discussion local to just a few tcp/ip
>> developers to see if there is any consensus that the below is a logical
>> approach. Please also pass this email if there is a "owner(s)" of this stack
>> to identify if a case exists for the below possible changes.
>>
>> I am not currently on the linux kernel mail group.
>>
>> I have experience with modifications of the Linux tcp/ip stack, and have
>> merged the changes into the company's local tree and left the possible
>> global integration to others.
>>
>> I have been approached by a number of companies about scaling the
>> stack with the assumption of a number of cpu cores. At present, I find extra
>> time on my hands and am considering looking into this area on my own.
>>
>> The first assumption is that if extra cores are available, that a single
>> received homogeneous flow of a large number of packets/segments per
>> second (pps) can be split into non-equal flows. This split can in effect
>> allow a larger recv'd pps rate at the same core load while splitting off
>> other workloads, such as xmit'ing pure ACKs.
>>
>> Simply, again assuming Amdahl's law (and not looking to equalize the load
>> between cores), and creating logical separations where in a many core
>> system, different cores could have new kernel threads that operate in
>> parallel within the tcp/ip stack. The initial separation points would be at
>> the ip/tcp layer boundry and where any recv'd sk/pkt would generate some
>> form of output.
>>
>> The ip/tcp layer would be split like the vintage AT&T STREAMs protocol,
>> with some form of queuing & scheduling, would be needed. In addition,
>> the queuing/schedullng of other kernel threads would occur within ip & tcp
>> to separate the I/O.
>>
>> A possible validation test is to identify the max recv'd pps rate within the
>> tcp/ip modules within normal flow TCP established state with normal order
>> of say 64byte non fragmented segments, before and after each
>> incremental change. Or the same rate with fewer core/cpu cycles.
>>
>> I am willing to have a private git Linux.org tree that concentrates proposed
>> changes into this tree and if there is willingness, a seen want/need then identify
>> how to implement the merge.
>
> Hi Mitchell
>
> We work everyday to improve network stack, and standard linux tree is
> pretty scalable, you dont need to setup a separate git tree for that.
>
> Our beloved maintainer David S. Miller handles two trees, net-2.6 and
> net-next-2.6 where we put all our changes.
>
> http://git.kernel.org/?p=linux/kernel/git/davem/net-next-2.6.git
> git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next-2.6.git
>
> I suggest you read the last patches (say .. about 10.000 of them), to
> have an idea of things we did during last years.
>
> keywords : RCU, multiqueue, RPS, percpu data, lockless algos, cache line
> placement...
>
> Its nice to see another man joining the team !
>
> Thanks
>
Lets start with a two part Linux kernel change and a tcp input/output change:
2 Parts: 2nd part TBD
Summary: Don't use last free pages for TCP ACKs with GFP_ATOMIC for our
sk buf allocs. 1 line change in tcp_output.c with a new gfp.h arg, and a change
in the generic kernel. TBD.
This change should have no effect with normal available kernel mem allocs.
Assuming memory pressure ( WAITING for clean memory) we should be allocating
our last pages for input skbufs and not for xmit allocs.
By delaying skbuf allocations when we have low kmem, we secondarily slow down the
tcp flow : if in slow start (SS) we are almost doing a DELACK, else CA should/could
decrease the number of in-flight ACKs and the peer should do burst avoidance
if our later ack increases the window in a larger chunk..
And use the last pages to decrease the chance of dropping a input pkt or
running out of recv descriptors, because of mem back pressure.
The change could check for some form of mem pressure before the alloc,
but the alloc in itself should suffice. We could also do a ECN type check before
the alloc.
Now the kicker. I want a GFP_KERNEL with NO_SLEEP OR a GFP_ATOMIC and
NOT use emergency pools, thus CAN FAIL, to have 0 other secondary effects
and change just the 1 arg.
code : tcp_output.c : tcp_send_ack()
line : buff = alloc_skb(MAX_TCP_HDR, GFP_KERNEL_NSLEEP); /* with a NO SLEEP */
Suggestions, feedback??
Mitchell Erblich
>
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@...r.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Powered by blists - more mailing lists