netdev - Re: Proposed linux kernel changes : scaling tcp/ip stack : 3rd part

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <1276670223.19249.77.camel@edumazet-laptop>
Date:	Wed, 16 Jun 2010 08:37:03 +0200
From:	Eric Dumazet <eric.dumazet@...il.com>
To:	Mitchell Erblich <erblichs@...thlink.net>
Cc:	netdev@...r.kernel.org
Subject: Re: Proposed linux kernel changes : scaling  tcp/ip stack : 3rd
 part

Le mardi 15 juin 2010 à 23:09 -0700, Mitchell Erblich a écrit :
> On Jun 15, 2010, at 8:30 PM, Mitchell Erblich wrote:
> 
> > 
> > On Jun 15, 2010, at 8:11 PM, Mitchell Erblich wrote:
> > 
> >> 
> >> On Jun 3, 2010, at 2:14 AM, Eric Dumazet wrote:
> >> 
> >>> Le jeudi 03 juin 2010 à 01:16 -0700, Mitchell Erblich a écrit :
> >>>> To whom it may concern,
> >>>> 
> >>>> First, my assumption is to keep this discussion local to just a few tcp/ip
> >>>> developers to see if there is any consensus that the below is a logical 
> >>>> approach. Please also pass this email if there is a "owner(s)" of this stack
> >>>> to identify if a case exists for the below possible changes.
> >>>> 
> >>>> I am not currently on the linux kernel mail group.
> >>>> 			
> >>>> I have experience with modifications of the Linux tcp/ip stack, and have
> >>>> merged the changes into the company's local tree and left the possible 
> >>>> global integration to others.
> >>>> 
> >>>> I have been approached by a number of companies about scaling the
> >>>> stack with the assumption of a number of cpu cores. At present, I find extra
> >>>> time on my hands and am considering looking into this area on my own.
> >>>> 
> >>>> The first assumption is that if extra cores are available, that a single
> >>>> received homogeneous flow of a large number of packets/segments per
> >>>> second (pps) can be split into non-equal flows. This split can in effect
> >>>> allow a larger recv'd pps rate at the same core load while splitting off
> >>>> other workloads, such as xmit'ing pure ACKs.
> >>>> 
> >>>> Simply, again assuming Amdahl's law (and not looking to equalize the load
> >>>> between cores), and creating logical separations where in a many core 
> >>>> system, different cores could have new kernel threads  that operate in 
> >>>> parallel within the tcp/ip stack. The initial separation points would be at 
> >>>> the ip/tcp layer boundry and where any recv'd sk/pkt would generate some 
> >>>> form of output.
> >>>> 
> >>>> The ip/tcp layer would be split like the vintage AT&T STREAMs protocol,
> >>>> with some form of queuing & scheduling, would be needed. In addition,
> >>>> the queuing/schedullng of other kernel threads would occur within ip & tcp
> >>>> to separate the I/O.
> >>>> 
> >>>> A possible validation test is to identify the max recv'd pps rate within the
> >>>> tcp/ip modules within normal flow TCP established state with normal order 
> >>>> of say 64byte non fragmented segments, before and after each 
> >>>> incremental change. Or the same rate with fewer core/cpu cycles.
> >>>> 
> >>>> I am willing to have a private git Linux.org tree that concentrates proposed
> >>>> changes into this tree and if there is willingness, a seen want/need then identify
> >>>> how to implement the merge.
> >>> 
> >>> Hi Mitchell
> >>> 
> >>> We work everyday to improve network stack, and standard linux tree is
> >>> pretty scalable, you dont need to setup a separate git tree for that.
> >>> 
> >>> Our beloved maintainer David S. Miller handles two trees, net-2.6 and
> >>> net-next-2.6 where we put all our changes.
> >>> 
> >>> http://git.kernel.org/?p=linux/kernel/git/davem/net-next-2.6.git
> >>> git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next-2.6.git
> >>> 
> >>> I suggest you read the last patches (say .. about 10.000 of them), to
> >>> have an idea of things we did during last years.
> >>> 
> >>> keywords : RCU, multiqueue, RPS, percpu data, lockless algos, cache line
> >>> placement...
> >>> 
> >>> Its nice to see another man joining the team !
> >>> 
> >>> Thanks
> >>> 
> >> 
> >> 
> >> Lets start with a two part Linux kernel change and a tcp input/output change:
> >> 
> >> 2 Parts: 2nd part TBD
> >> 
> >> Summary: Don't use last free pages for TCP ACKs with GFP_ATOMIC for our
> >> sk buf allocs. 1 line change in tcp_output.c with a new gfp.h arg, and a change
> >> in the generic kernel. TBD.
> >> 
> >> This change should have no effect with normal available kernel mem allocs.
> >> 
> >> Assuming memory pressure ( WAITING for clean memory) we should be allocating
> >> our last pages for input skbufs and not for xmit allocs.
> >> 
> >> By delaying skbuf allocations when we have low kmem, we secondarily slow down the
> >> tcp flow : if in slow start (SS) we are almost doing a DELACK, else CA should/could
> >> decrease the number of in-flight ACKs and the peer should do burst avoidance
> >> if our later ack increases the window in a larger chunk..
> >> 
> >> And use the last pages to decrease the chance of dropping a input pkt or
> >> running out of recv descriptors, because of mem back pressure.
> >> 
> >> The change could check for some form of mem pressure before the alloc,
> >> but the alloc in itself should suffice. We could also do a ECN type check before
> >> the alloc.
> >> 
> >> Now the kicker.  I want a GFP_KERNEL with NO_SLEEP OR a GFP_ATOMIC and
> >> NOT use emergency pools, thus CAN FAIL, to have 0 other secondary effects
> >> and change just the 1 arg.
> >> 
> >> code : tcp_output.c : tcp_send_ack()
> >>  line : buff = alloc_skb(MAX_TCP_HDR, GFP_KERNEL_NSLEEP);   /* with a NO SLEEP */
> >> 
> >> Suggestions, feedback??
> >> 
> >> Mitchell Erblich
> >> 
> >> 
> >> 
> >> 
> > 
> > Sorry :),
> > 
> > 		2nd part:
> > 
> > 		use GFP_NOWAIT as 2nd arg to alloc_skb()
> > 
> > Mitchell Erblich
> 
> Going in the same direction,
> 
> 
> If tcp_out_of_resources() and the number of orphaned sockets is above
> a configured number (maybe because of DoS attack), SHOULD we consume
> our last available resources and most likely effect skbufs that we aren't
> reset-ing because NOW the recv sk allocs are failing.
> 
> thus,
> file tcp_timer.c : tcp_out_of_resources()
> suggestion change 2nd arg GFP_ATOMIC: tcp_send_active_reset(sk, GFP_NOWAIT);
> 
> Please note that even if we believed that the GFP_ATOMIC would have a higher 
> probability to send a TCP pkt/seg, that gives us no guarantee that the peer
> will recv it or will process it.
> 
> We COULD also do some form of ECN in this function to inform the peer that our 
> system is in distress if tcp_send_active_reset() did not return void and informed 
> us of a mem alloc failure with the GFP_NOWAIT.
> 
> Since the ECN would benefit the our node/system, this ECN sending event COULD
> be argued to have a higher priority and mem argument then sent with a GFP_ATOMIC.
> 
> 
> Suggestions, opinions...


1) Acks are about the smallest chunks that are ever allocated in network
stack.

2) Their lifetime is close to 0 us. They are not cloned (queued on a
socket queue), only given to device xmit. Unless you play with trafic
shaping and insane queue lengths, acks should not use more than 0.0001 %
of your ram.

3) Under attack, adding complex algos to try to resist only delay a bit
the moment where nothing can be done to stop the attack. Being clever or
not. Dropping packets is very fine.

4) Maybe all the work you think about is the balance between ATOMIC and
non ATOMIC (GFP_KERNEL) memory allocations ? Some tuning maybe ?
   input path always use ATOMIC ops, being run from sofirq, and cannot
wait.


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html