[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <1349467578.21172.178.camel@edumazet-glaptop>
Date: Fri, 05 Oct 2012 22:06:18 +0200
From: Eric Dumazet <eric.dumazet@...il.com>
To: Rick Jones <rick.jones2@...com>
Cc: Herbert Xu <herbert@...dor.apana.org.au>,
David Miller <davem@...emloft.net>,
netdev <netdev@...r.kernel.org>, Jesse Gross <jesse@...ira.com>
Subject: Re: [RFC] GRO scalability
On Fri, 2012-10-05 at 12:35 -0700, Rick Jones wrote:
> Just how much code path is there between NAPI and the socket?? (And I
> guess just how much combining are you hoping for?)
>
When GRO correctly works, you can save about 30% of cpu cycles, it
depends...
Doubling MAX_SKB_FRAGS (allowing 32+1 MSS per GRO skb instead of 16+1)
gives an improvement as well...
> > Lets say we allow no more than 1ms of delay in GRO,
>
> OK. That means we can ignore HPC and FSI because they wouldn't tolerate
> that kind of added delay anyway. I'm not sure if that also then
> eliminates the networked storage types.
>
I took this 1ms delay, but I never said it was a fixed value ;)
Also remember one thing, this is the _max_ delay in case your napi
handler is flooded. This almost never happen (tm)
> > this means we could have about 400 packets in the GRO queue (assuming
> > 1500 bytes packets)
>
> How many flows are you going to have entering via that queue? And just
> how well "shuffled" will the segments of those flows be? That is what
> it all comes down to right? How many (active) flows and how well
> shuffled they are. If the flows aren't well shuffled, you can get away
> with a smallish coalescing context. If they are perfectly shuffled and
> greater in number than your delay allowance you get right back to square
> with all the overhead of GRO attempts with none of the benefit.
Not sure what you mean by shuffle. We use a hash table to locate a flow,
but we also have a LRU list to get the packets ordered by their entry in
the 'GRO unit'.
If napi completes, all the LRU list content is flushed to IP stack.
( napi_gro_flush())
If napi doesnt complete, we would only flush 'too old' packets found in
the LRU.
Note: this selective flush can be called once per napi run from
net_rx_action(). Extra cost to get a somewhat precise timestamp
would be acceptable (one call to ktime_get() or get_cycles() every 64
packets)
This timestamp could be stored in napi->timestamp and done once per
n->poll(n, weight) call.
>
> If the flow count is < 400 to allow a decent shot at a non-zero
> combining rate on well shuffled flows with the 400 packet limit, then
> that means each flow is >= 12.5 Mbit/s on average at 5 Gbit/s
> aggregated. And I think you then get two segments per flow aggregated
> at a time. Is that consistent with what you expect to be the
> characteristics of the flows entering via that queue?
If a packet cant stay more than 1ms, then a flow sending less than 1000
packets per second wont benefit from GRO.
So yes, 12.5 Mbit/s would be the threshold.
By the way, when TCP timestamps are used, and hosts are linux machines
with HZ=1000, current GRO can not coalesce packets anyway because their
TCP options are different.
(So it would be not useful trying bigger sojourn time than 1ms)
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Powered by blists - more mailing lists