[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20180713130840.1b6b78ea@redhat.com>
Date: Fri, 13 Jul 2018 13:08:40 +0200
From: Jesper Dangaard Brouer <brouer@...hat.com>
To: Or Gerlitz <gerlitz.or@...il.com>
Cc: Edward Cree <ecree@...arflare.com>,
Saeed Mahameed <saeedm@...lanox.com>,
"netdev@...r.kernel.org" <netdev@...r.kernel.org>,
brouer@...hat.com
Subject: Re: [net-next PATCH] net: ipv4: fix listify ip_rcv_finish in case
of forwarding
On Thu, 12 Jul 2018 23:10:28 +0300 Or Gerlitz <gerlitz.or@...il.com> wrote:
> On Wed, Jul 11, 2018 at 11:06 PM, Jesper Dangaard Brouer
> <brouer@...hat.com> wrote:
>
> > Well, I would prefer you to implement those. I just did a quick
> > implementation (its trivially easy) so I have something to benchmark
> > with. The performance boost is quite impressive!
>
> sounds good, but wait
>
>
> > One reason I didn't "just" send a patch, is that Edward so-fare only
> > implemented netif_receive_skb_list() and not napi_gro_receive_list().
>
> sfc does't support gro?! doesn't make sense.. Edward?
>
> > And your driver uses napi_gro_receive(). This sort-of disables GRO for
> > your driver, which is not a choice I can make. Interestingly I get
> > around the same netperf TCP_STREAM performance.
>
> Same TCP performance
I said around the same... I'll redo the benchmarks and verify...
(did it.. see later).
> with GRO and no rx-batching
>
> or
>
> without GRO and yes rx-batching
Yes, obviously without GRO and yes rx-batching.
> is by far not intuitive result to me unless both these techniques
> mostly serve to eliminate lots of instruction cache misses and the
> TCP stack is so much optimized that if the code is in the cache,
> going through it once with 64K byte GRO-ed packet is like going
> through it ~40 (64K/1500) times with non GRO-ed packets.
Actually the GRO code path is actually rather expensive, and uses a lot
of indirect-calls. If you have an UDP workload, then disable-GRO will
give you a 10-15% performance boost.
Edward's changes are basically a generalized version of GRO, up-to the
IP layer (ip_rcv). So, for me it makes perfect sense.
> What's the baseline (with GRO and no rx-batching) number on your setup?
Okay, redoing the benchmarks...
Implemented a code hack so I runtime can control if mlx5 driver uses
napi_gro_receive() or netif_receive_skb_list() (abusing a netdev ethtool
controlled feature flag no-in-use).
To get a quick test going with feedback every 3 sec I use:
$ netperf -t TCP_STREAM -H 198.18.1.1 -D3 -l 60000 -T 4,4
Default: using napi_gro_receive() with GRO enabled:
Interim result: 25995.28 10^6bits/s over 3.000 seconds
Disable GRO but still use napi_gro_receive():
Interim result: 21980.45 10^6bits/s over 3.001 seconds
Make driver use netif_receive_skb_list():
Interim result: 25490.67 10^6bits/s over 3.002 seconds
As you can see, using netif_receive_skb_list() have a huge performance
boost over disabled-GRO. And it comes very close to the performance
of enabled-GRO. Which is rather impressive! :-)
Notice, even more impressively; these tests are without CONFIG_RETPOLINE.
We primarily merged netif_receive_skb_list() due to the overhead of
RETPOLINEs, but we even see a benefit when not using RETPOLINEs.
> > I assume we can get even better perf if we "listify" napi_gro_receive.
>
> yeah, that would be very interesting to get there
--
Best regards,
Jesper Dangaard Brouer
MSc.CS, Principal Kernel Engineer at Red Hat
LinkedIn: http://www.linkedin.com/in/brouer
Powered by blists - more mailing lists