netdev - Re: Proposed linux kernel changes : scaling tcp/ip stack : 3rd part

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-Id: <01357CF3-35A2-45AC-9950-E4EBE955F225@earthlink.net>
Date:	Wed, 16 Jun 2010 00:46:07 -0700
From:	Mitchell Erblich <erblichs@...thlink.net>
To:	Eric Dumazet <eric.dumazet@...il.com>
Cc:	netdev@...r.kernel.org
Subject: Re: Proposed linux kernel changes : scaling  tcp/ip stack : 3rd part


On Jun 15, 2010, at 11:37 PM, Eric Dumazet wrote:

> Le mardi 15 juin 2010 à 23:09 -0700, Mitchell Erblich a écrit :
>> On Jun 15, 2010, at 8:30 PM, Mitchell Erblich wrote:
>> 
>>> 
>>> On Jun 15, 2010, at 8:11 PM, Mitchell Erblich wrote:
>>> 
>>>> 
>>>> On Jun 3, 2010, at 2:14 AM, Eric Dumazet wrote:
>>>> 
>>>>> Le jeudi 03 juin 2010 à 01:16 -0700, Mitchell Erblich a écrit :
>>>>>> To whom it may concern,
>>>>>> 
>>>>>> First, my assumption is to keep this discussion local to just a few tcp/ip
>>>>>> developers to see if there is any consensus that the below is a logical 
>>>>>> approach. Please also pass this email if there is a "owner(s)" of this stack
>>>>>> to identify if a case exists for the below possible changes.
>>>>>> 
>>>>>> I am not currently on the linux kernel mail group.
>>>>>> 			
>>>>>> I have experience with modifications of the Linux tcp/ip stack, and have
>>>>>> merged the changes into the company's local tree and left the possible 
>>>>>> global integration to others.
>>>>>> 
>>>>>> I have been approached by a number of companies about scaling the
>>>>>> stack with the assumption of a number of cpu cores. At present, I find extra
>>>>>> time on my hands and am considering looking into this area on my own.
>>>>>> 
>>>>>> The first assumption is that if extra cores are available, that a single
>>>>>> received homogeneous flow of a large number of packets/segments per
>>>>>> second (pps) can be split into non-equal flows. This split can in effect
>>>>>> allow a larger recv'd pps rate at the same core load while splitting off
>>>>>> other workloads, such as xmit'ing pure ACKs.
>>>>>> 
>>>>>> Simply, again assuming Amdahl's law (and not looking to equalize the load
>>>>>> between cores), and creating logical separations where in a many core 
>>>>>> system, different cores could have new kernel threads  that operate in 
>>>>>> parallel within the tcp/ip stack. The initial separation points would be at 
>>>>>> the ip/tcp layer boundry and where any recv'd sk/pkt would generate some 
>>>>>> form of output.
>>>>>> 
>>>>>> The ip/tcp layer would be split like the vintage AT&T STREAMs protocol,
>>>>>> with some form of queuing & scheduling, would be needed. In addition,
>>>>>> the queuing/schedullng of other kernel threads would occur within ip & tcp
>>>>>> to separate the I/O.
>>>>>> 
>>>>>> A possible validation test is to identify the max recv'd pps rate within the
>>>>>> tcp/ip modules within normal flow TCP established state with normal order 
>>>>>> of say 64byte non fragmented segments, before and after each 
>>>>>> incremental change. Or the same rate with fewer core/cpu cycles.
>>>>>> 
>>>>>> I am willing to have a private git Linux.org tree that concentrates proposed
>>>>>> changes into this tree and if there is willingness, a seen want/need then identify
>>>>>> how to implement the merge.
>>>>> 
>>>>> Hi Mitchell
>>>>> 
>>>>> We work everyday to improve network stack, and standard linux tree is
>>>>> pretty scalable, you dont need to setup a separate git tree for that.
>>>>> 
>>>>> Our beloved maintainer David S. Miller handles two trees, net-2.6 and
>>>>> net-next-2.6 where we put all our changes.
>>>>> 
>>>>> http://git.kernel.org/?p=linux/kernel/git/davem/net-next-2.6.git
>>>>> git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next-2.6.git
>>>>> 
>>>>> I suggest you read the last patches (say .. about 10.000 of them), to
>>>>> have an idea of things we did during last years.
>>>>> 
>>>>> keywords : RCU, multiqueue, RPS, percpu data, lockless algos, cache line
>>>>> placement...
>>>>> 
>>>>> Its nice to see another man joining the team !
>>>>> 
>>>>> Thanks
>>>>> 
>>>> 
>>>> 
>>>> Lets start with a two part Linux kernel change and a tcp input/output change:
>>>> 
>>>> 2 Parts: 2nd part TBD
>>>> 
>>>> Summary: Don't use last free pages for TCP ACKs with GFP_ATOMIC for our
>>>> sk buf allocs. 1 line change in tcp_output.c with a new gfp.h arg, and a change
>>>> in the generic kernel. TBD.
>>>> 
>>>> This change should have no effect with normal available kernel mem allocs.
>>>> 
>>>> Assuming memory pressure ( WAITING for clean memory) we should be allocating
>>>> our last pages for input skbufs and not for xmit allocs.
>>>> 
>>>> By delaying skbuf allocations when we have low kmem, we secondarily slow down the
>>>> tcp flow : if in slow start (SS) we are almost doing a DELACK, else CA should/could
>>>> decrease the number of in-flight ACKs and the peer should do burst avoidance
>>>> if our later ack increases the window in a larger chunk..
>>>> 
>>>> And use the last pages to decrease the chance of dropping a input pkt or
>>>> running out of recv descriptors, because of mem back pressure.
>>>> 
>>>> The change could check for some form of mem pressure before the alloc,
>>>> but the alloc in itself should suffice. We could also do a ECN type check before
>>>> the alloc.
>>>> 
>>>> Now the kicker.  I want a GFP_KERNEL with NO_SLEEP OR a GFP_ATOMIC and
>>>> NOT use emergency pools, thus CAN FAIL, to have 0 other secondary effects
>>>> and change just the 1 arg.
>>>> 
>>>> code : tcp_output.c : tcp_send_ack()
>>>> line : buff = alloc_skb(MAX_TCP_HDR, GFP_KERNEL_NSLEEP);   /* with a NO SLEEP */
>>>> 
>>>> Suggestions, feedback??
>>>> 
>>>> Mitchell Erblich
>>>> 
>>>> 
>>>> 
>>>> 
>>> 
>>> Sorry :),
>>> 
>>> 		2nd part:
>>> 
>>> 		use GFP_NOWAIT as 2nd arg to alloc_skb()
>>> 
>>> Mitchell Erblich
>> 
>> Going in the same direction,
>> 
>> 
>> If tcp_out_of_resources() and the number of orphaned sockets is above
>> a configured number (maybe because of DoS attack), SHOULD we consume
>> our last available resources and most likely effect skbufs that we aren't
>> reset-ing because NOW the recv sk allocs are failing.
>> 
>> thus,
>> file tcp_timer.c : tcp_out_of_resources()
>> suggestion change 2nd arg GFP_ATOMIC: tcp_send_active_reset(sk, GFP_NOWAIT);
>> 
>> Please note that even if we believed that the GFP_ATOMIC would have a higher 
>> probability to send a TCP pkt/seg, that gives us no guarantee that the peer
>> will recv it or will process it.
>> 
>> We COULD also do some form of ECN in this function to inform the peer that our 
>> system is in distress if tcp_send_active_reset() did not return void and informed 
>> us of a mem alloc failure with the GFP_NOWAIT.
>> 
>> Since the ECN would benefit the our node/system, this ECN sending event COULD
>> be argued to have a higher priority and mem argument then sent with a GFP_ATOMIC.
>> 
>> 
>> Suggestions, opinions...
> 
> 
> 1) Acks are about the smallest chunks that are ever allocated in network
> stack.
> 
> 2) Their lifetime is close to 0 us. They are not cloned (queued on a
> socket queue), only given to device xmit. Unless you play with trafic
> shaping and insane queue lengths, acks should not use more than 0.0001 %
> of your ram.
> 
> 3) Under attack, adding complex algos to try to resist only delay a bit
> the moment where nothing can be done to stop the attack. Being clever or
> not. Dropping packets is very fine.
> 
> 4) Maybe all the work you think about is the balance between ATOMIC and
> non ATOMIC (GFP_KERNEL) memory allocations ? Some tuning maybe ?
>   input path always use ATOMIC ops, being run from sofirq, and cannot
> wait.
> 

I am not suggesting a pure GFP_KERNEL /sleep allocs on either side.
I suggested it with a NO-SLEEP and then looked and saw GFP_NOWAIT.

However, when under mem pressure, it might make sense to delay/slow
the opening of the snd and recv TCP windows, to allow mem pages
to be cleaned and re-used to relieve the mem pressure.

If queues are already existing, then they are the first delay-points
to be reviewed / used, however only under abnormal circumstances.

Maybe even short-lived flows have expired and those resources can now
be used.

> 
> --

Eric & group,

I am starting simple with a different look at what COULD/SHOULD IT be done?

The question that I am asking is if GFP_NOWAIT WOULD fail  AND GFP_ATOMIC
would succeed, then only the last percentage of kernel memory allocations are
being done. Do you want to use this last few percentage of memory with ACKs/xmits?

Maybe if we failed a few non-necessary / delayable items, then enough time
may occur to clean pages, and not execute any OOM like code.

The effect of generating a short DELACK SHOULD reduce memory pressure from 
the peer in 1 RTT. Also, failing and executing the non-buff code, MAY ALSO slow-down
the ramp-up based on the TCP ACK clock.

The number of ACTIVE (recv at the NIC to xmit at the NIC) tcp flows may account
for a non-neglible number of later allocs and decrease mem pressure.

If I understand the diff between GFP_NOWAIT and GFP_ATOMIC, both
don't sleep, and only GFP_NOWAIT doesn't grab for the last available pages. So,
they are both atomic/no-sleeps. 

To conclude with part 4: tcp_timer.c has 2 additional calls to tcp_send_active_reset()
use GFP_NOWAIT instead of GFP_ATOMIC.

Now, with your statement about "input path always uses atomic". SHOULD IT?
SAY under a DoS attack? Why not reject some if no-memory via GFP_NOWAIT.
If a new flow is to be started and we are under memory pressure, should it not ALSO
use GFP_NOWAIT?  If we ALSO do GFP_NOWAIT on ESTABLISHED flows and (shoot me)
drop a seg/pkt, that should drop them to 1/2 bandwidth. This is in effect TCP fair-ness.
Thus, delays in ACKs are much more preferred.

Again, we will only minimally effect flows with the NEW suggested changes if their is 
mem pressure and going to do more aggressive things like reset sockets/flows.

Since, my suggested changes are NOT in the form of a patch, then someone ELSE 
needs to agree with me and then the maintainer must see that it does no harm and 
the changes slowly moves the code in the right direction.

Mitchell Erblich

==================


	
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@...r.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html