netdev - Re: [RFC] GRO scalability

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <1349715561.21172.3463.camel@edumazet-glaptop>
Date:	Mon, 08 Oct 2012 18:59:21 +0200
From:	Eric Dumazet <eric.dumazet@...il.com>
To:	Rick Jones <rick.jones2@...com>
Cc:	Herbert Xu <herbert@...dor.apana.org.au>,
	David Miller <davem@...emloft.net>,
	netdev <netdev@...r.kernel.org>, Jesse Gross <jesse@...ira.com>
Subject: Re: [RFC] GRO scalability

On Mon, 2012-10-08 at 09:40 -0700, Rick Jones wrote:
> On 10/05/2012 01:06 PM, Eric Dumazet wrote:
> > On Fri, 2012-10-05 at 12:35 -0700, Rick Jones wrote:
> >
> >> Just how much code path is there between NAPI and the socket?? (And I
> >> guess just how much combining are you hoping for?)
> >>
> >
> > When GRO correctly works, you can save about 30% of cpu cycles, it
> > depends...
> >
> > Doubling MAX_SKB_FRAGS (allowing 32+1 MSS per GRO skb instead of 16+1)
> > gives an improvement as well...
> 
> OK, but how much of that 30% come from where?  Each coalesced segment is 
> saving the cycles between NAPI and the socket.  Each avoided ACK is 
> saving the cycles from TCP to the bottom of the driver and a (share of) 
> transmit completion.

It comes from the fact that you have less competition between Bottom
Half handler and application on socket lock, not counting all layers
that we have to cross (IP, netfilter ...)

Each time a TCP packet is delivered and socket owned by the user, packet
is placed on a special 'backlog queue', and application has to process
this packet right before releasing socket lock. It sucks because it adds
latencies, and other frames are queued to backlokg since application
processes the backlog (very expensive because of cache line misses)

So GRO really makes this kind of event less probable.

> 
> Whe I say shuffle I mean something along the lines of interleave.  So, 
> if we have four flows, 1-4, a perfect shuffle of their segments would be 
> something like:
> 
> 1 2 3 4 1 2 3 4 1 2 3 4
> 
> but not well shuffled might look like
> 
> 1 1 3 2 3 2 4 4 4 1 3 2
> 

If all these packets are delivered in the same NAPI run, and correctly
aggregated, their order doesnt matter.

In first case, we will deliver  B1, B2, B3, B4   (B being a GRO packet
with 3 MSS)

In second case we will deliver

B1 B3 B2 B4



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html