netdev - Re: Proposed linux kernel changes : scaling tcp/ip stack : 2nd part

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-Id: <D70D2636-7C49-48A5-B4DC-B9583A448415@earthlink.net>
Date:	Tue, 15 Jun 2010 20:30:59 -0700
From:	Mitchell Erblich <erblichs@...thlink.net>
To:	Mitchell Erblich <erblichs@...thlink.net>
Cc:	Eric Dumazet <eric.dumazet@...il.com>, netdev@...r.kernel.org
Subject: Re: Proposed linux kernel changes : scaling  tcp/ip stack : 2nd part


On Jun 15, 2010, at 8:11 PM, Mitchell Erblich wrote:

> 
> On Jun 3, 2010, at 2:14 AM, Eric Dumazet wrote:
> 
>> Le jeudi 03 juin 2010 à 01:16 -0700, Mitchell Erblich a écrit :
>>> To whom it may concern,
>>> 
>>> First, my assumption is to keep this discussion local to just a few tcp/ip
>>> developers to see if there is any consensus that the below is a logical 
>>> approach. Please also pass this email if there is a "owner(s)" of this stack
>>> to identify if a case exists for the below possible changes.
>>> 
>>> I am not currently on the linux kernel mail group.
>>> 			
>>> I have experience with modifications of the Linux tcp/ip stack, and have
>>> merged the changes into the company's local tree and left the possible 
>>> global integration to others.
>>> 
>>> I have been approached by a number of companies about scaling the
>>> stack with the assumption of a number of cpu cores. At present, I find extra
>>> time on my hands and am considering looking into this area on my own.
>>> 
>>> The first assumption is that if extra cores are available, that a single
>>> received homogeneous flow of a large number of packets/segments per
>>> second (pps) can be split into non-equal flows. This split can in effect
>>> allow a larger recv'd pps rate at the same core load while splitting off
>>> other workloads, such as xmit'ing pure ACKs.
>>> 
>>> Simply, again assuming Amdahl's law (and not looking to equalize the load
>>> between cores), and creating logical separations where in a many core 
>>> system, different cores could have new kernel threads  that operate in 
>>> parallel within the tcp/ip stack. The initial separation points would be at 
>>> the ip/tcp layer boundry and where any recv'd sk/pkt would generate some 
>>> form of output.
>>> 
>>> The ip/tcp layer would be split like the vintage AT&T STREAMs protocol,
>>> with some form of queuing & scheduling, would be needed. In addition,
>>> the queuing/schedullng of other kernel threads would occur within ip & tcp
>>> to separate the I/O.
>>> 
>>> A possible validation test is to identify the max recv'd pps rate within the
>>> tcp/ip modules within normal flow TCP established state with normal order 
>>> of say 64byte non fragmented segments, before and after each 
>>> incremental change. Or the same rate with fewer core/cpu cycles.
>>> 
>>> I am willing to have a private git Linux.org tree that concentrates proposed
>>> changes into this tree and if there is willingness, a seen want/need then identify
>>> how to implement the merge.
>> 
>> Hi Mitchell
>> 
>> We work everyday to improve network stack, and standard linux tree is
>> pretty scalable, you dont need to setup a separate git tree for that.
>> 
>> Our beloved maintainer David S. Miller handles two trees, net-2.6 and
>> net-next-2.6 where we put all our changes.
>> 
>> http://git.kernel.org/?p=linux/kernel/git/davem/net-next-2.6.git
>> git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next-2.6.git
>> 
>> I suggest you read the last patches (say .. about 10.000 of them), to
>> have an idea of things we did during last years.
>> 
>> keywords : RCU, multiqueue, RPS, percpu data, lockless algos, cache line
>> placement...
>> 
>> Its nice to see another man joining the team !
>> 
>> Thanks
>> 
> 
> 
> Lets start with a two part Linux kernel change and a tcp input/output change:
> 
> 2 Parts: 2nd part TBD
> 
> Summary: Don't use last free pages for TCP ACKs with GFP_ATOMIC for our
> sk buf allocs. 1 line change in tcp_output.c with a new gfp.h arg, and a change
> in the generic kernel. TBD.
> 
> This change should have no effect with normal available kernel mem allocs.
> 
> Assuming memory pressure ( WAITING for clean memory) we should be allocating
> our last pages for input skbufs and not for xmit allocs.
> 
> By delaying skbuf allocations when we have low kmem, we secondarily slow down the
> tcp flow : if in slow start (SS) we are almost doing a DELACK, else CA should/could
> decrease the number of in-flight ACKs and the peer should do burst avoidance
> if our later ack increases the window in a larger chunk..
> 
> And use the last pages to decrease the chance of dropping a input pkt or
> running out of recv descriptors, because of mem back pressure.
> 
> The change could check for some form of mem pressure before the alloc,
> but the alloc in itself should suffice. We could also do a ECN type check before
> the alloc.
> 
> Now the kicker.  I want a GFP_KERNEL with NO_SLEEP OR a GFP_ATOMIC and
> NOT use emergency pools, thus CAN FAIL, to have 0 other secondary effects
> and change just the 1 arg.
> 
> code : tcp_output.c : tcp_send_ack()
>   line : buff = alloc_skb(MAX_TCP_HDR, GFP_KERNEL_NSLEEP);   /* with a NO SLEEP */
> 
> Suggestions, feedback??
> 
> Mitchell Erblich
> 
> 
> 
> 

Sorry :),

		2nd part:

		use GFP_NOWAIT as 2nd arg to alloc_skb()

Mitchell Erblich
> 
>> 
>> --
>> To unsubscribe from this list: send the line "unsubscribe netdev" in
>> the body of a message to majordomo@...r.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html