netdev - Re: [RFC] GRO scalability

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <1349467578.21172.178.camel@edumazet-glaptop>
Date:	Fri, 05 Oct 2012 22:06:18 +0200
From:	Eric Dumazet <eric.dumazet@...il.com>
To:	Rick Jones <rick.jones2@...com>
Cc:	Herbert Xu <herbert@...dor.apana.org.au>,
	David Miller <davem@...emloft.net>,
	netdev <netdev@...r.kernel.org>, Jesse Gross <jesse@...ira.com>
Subject: Re: [RFC] GRO scalability

On Fri, 2012-10-05 at 12:35 -0700, Rick Jones wrote:

> Just how much code path is there between NAPI and the socket?? (And I 
> guess just how much combining are you hoping for?)
> 

When GRO correctly works, you can save about 30% of cpu cycles, it
depends...

Doubling MAX_SKB_FRAGS (allowing 32+1 MSS per GRO skb instead of 16+1)
gives an improvement as well...

> > Lets say we allow no more than 1ms of delay in GRO,
> 
> OK.  That means we can ignore HPC and FSI because they wouldn't tolerate 
> that kind of added delay anyway.  I'm not sure if that also then 
> eliminates the networked storage types.
> 

I took this 1ms delay, but I never said it was a fixed value ;)

Also remember one thing, this is the _max_ delay in case your napi
handler is flooded. This almost never happen (tm)

> > this means we could have about 400 packets in the GRO queue (assuming
> > 1500 bytes packets)
> 
> How many flows are you going to have entering via that queue?  And just 
> how well "shuffled" will the segments of those flows be?  That is what 
> it all comes down to right?  How many (active) flows and how well 
> shuffled they are.  If the flows aren't well shuffled, you can get away 
> with a smallish coalescing context.  If they are perfectly shuffled and 
> greater in number than your delay allowance you get right back to square 
> with all the overhead of GRO attempts with none of the benefit.

Not sure what you mean by shuffle. We use a hash table to locate a flow,
but we also have a LRU list to get the packets ordered by their entry in
the 'GRO unit'.

If napi completes, all the LRU list content is flushed to IP stack.
( napi_gro_flush()) 

If napi doesnt complete, we would only flush 'too old' packets found in
the LRU.

Note: this selective flush can be called once per napi run from
net_rx_action(). Extra cost to get a somewhat precise timestamp
would be acceptable (one call to ktime_get() or get_cycles() every 64
packets)

This timestamp could be stored in napi->timestamp and done once per
n->poll(n, weight) call.

> 
> If the flow count is < 400 to allow a decent shot at a non-zero 
> combining rate on well shuffled flows with the 400 packet limit, then 
> that means each flow is >= 12.5 Mbit/s on average at 5 Gbit/s 
> aggregated.  And I think you then get two segments per flow aggregated 
> at a time.  Is that consistent with what you expect to be the 
> characteristics of the flows entering via that queue?

If a packet cant stay more than 1ms, then a flow sending less than 1000
packets per second wont benefit from GRO.

So yes, 12.5 Mbit/s would be the threshold.

By the way, when TCP timestamps are used, and hosts are linux machines
with HZ=1000, current GRO can not coalesce packets anyway because their
TCP options are different.

(So it would be not useful trying bigger sojourn time than 1ms)

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html