[<prev] [next>] [day] [month] [year] [list]
Message-ID: <20190213105212.GY3087@gauss3.secunet.de>
Date: Wed, 13 Feb 2019 11:52:12 +0100
From: Steffen Klassert <steffen.klassert@...unet.com>
To: Paolo Abeni <pabeni@...hat.com>
CC: Willem de Bruijn <willemb@...gle.com>,
"Jason A. Donenfeld" <Jason@...c4.com>, <netdev@...r.kernel.org>
Subject: Re: [PATCH RFC v2 1/3] UDP: enable GRO by default.
Move discussion back to the list...
On Wed, Feb 13, 2019 at 10:50:53AM +0100, Paolo Abeni wrote:
> Hi,
>
> I'm unsure if the off-list is intentional, anyhow I keep the reply off-
> list. Please feel free to move back the discussion on the ML as it fit
> you better.
Your last mail was already off-list, I've just replied to everyone in Cc :-)
>
> On Wed, 2019-02-13 at 10:04 +0100, Steffen Klassert wrote:
> > On Tue, Feb 12, 2019 at 05:18:38PM +0100, Paolo Abeni wrote:
> > > Hi,
> > >
> > > On Mon, 2019-01-28 at 09:50 +0100, Steffen Klassert wrote:
> > > > diff --git a/net/ipv6/udp_offload.c b/net/ipv6/udp_offload.c
> > > > index 83b11d0ac091..5f7937a4f71a 100644
> > > > --- a/net/ipv6/udp_offload.c
> > > > +++ b/net/ipv6/udp_offload.c
> > > > @@ -119,6 +119,8 @@ INDIRECT_CALLABLE_SCOPE
> > > > struct sk_buff *udp6_gro_receive(struct list_head *head, struct sk_buff *skb)
> > > > {
> > > > struct udphdr *uh = udp_gro_udphdr(skb);
> > > > + struct sk_buff *pp;
> > > > + struct sock *sk;
> > > >
> > > > if (unlikely(!uh) || !static_branch_unlikely(&udpv6_encap_needed_key))
> > > > goto flush;
> > >
> > > This ^^^^
> > > should be dropped, otherwise GRO is not enabled for ipv6.
> >
> > Yes, will change this in the next patchset.
> >
> > > I finally had some time to benchmark this series.
> > >
> > > I used a pktgen sender b2b connected with the DUT via a 10Gb link. The
> > > sender generates 64 bytes UDP packets, and and the receiver runs an
> > > UDP sink _without_ UDP_GRO.
> > >
> > > With aggregation (pktgen uses a single flow):
> > >
> > > vanilla patched delta
> > > (kpps) (kpps) (%)
> > > sink with 2000 2050 2.5
> > > rcvmsg
> > >
> > > sink with 1260 2900 130
> > > rcvmmsg
> >
> > Do you know the bottleneck with the patched kernel? In particular
> > with rcvmsg, I guess the bottleneck is a full socket receive queue.
>
> Yes, with the patched kernel the bottle-neck is the user-space sink:
> especially due to PTI the US/KS transition is very costly. This is why
> rcvmmsg gives so great improvements.
>
> IIRC the bottle-neck is still on the US even with rcvmmsg, because GRO
> (even with segmentation) gives a great speedup vs non GRO processing.
>
> Additional note: all the above is without any nf/ct related module
> loaded - that is, with the less favorable conditions for GRO.
>
> > > Note: with the unpatched kernel, the bottle-neck is in ksoftirq
> > > processing: making the receiver faster (with rcvmmsg) reduces the
> > > "goodput", as the kernel processing needs to spend more time to wake up
> > > the user-space process.
> > >
> > > When aggregation does not take place (pktgen changes the source port
> > > for each packet, creating 64K different flows):
> >
> > I tried this with random source ports on my forwarding setup.
> > Here the NIC distributes the packets to different RX queues,
> > so I could easily fill a 40G link with this workload.
>
> I'm sorry, I omitted another piece of information: In my tests I forced
> the NIC to use a single queue, (and with 64 bytes packets). Otherwise
> the link speed (or the bus b/w) easily become the bottle-neck and I
> could not measure any difference. I have only 10Gbs links handy.
Ok, so this is likely the worst case we can get.
>
> > > vanilla patched delta
> > > (kpps) (kpps) (%)
> > > sink with 2020 1620 -20
> > > rcvmsg
> > >
> > > sink with 1260 1110 -12
> > > rcvmmsg
> > >
> > > Here there is a measurable regression. I haven't tracked it fully, but
> > > it looks like it depends mostly on less cache-friendly execution: 64
> > > packets are stored in the GRO hash and than flushed altogether. When
> > > that happens, most packets headers are cold in the cache.
> >
> > With 64 packets in the gro hashtable, a L1D cache miss is rather
> > likely. Btw. why can we hold 64 flows in the gro hashtable now?
> > The old gro_list could hold up to 8 flows. Seems like
> >
> > commit 6312fe77751f57d4fa2b28abeef84c6a95c28136
> > net: limit each hash list length to MAX_GRO_SKBS
> >
> > changed this.
>
> AFAICS, yes each list can have at most 8 entries, but the hash has
> GRO_HASH_BUCKETS (8) buckets/lists for a possible total of 64 flows.
> Likely we will hit the MAX_GRO_SKBS limit before completely filling the
> hash, I have to instrument to get some real numbers, but likely at each
> napi_poll() completion we will have a lot of flows still there (almost
> 64).
>
> I'm using HZ == 1000, so I hit a napi_gro_flush() at each napi_poll()
>
> > I'm wondering if a prefetch(skb->data) in __napi_gro_flush_chain()
> > > could help?!? (even for TCP workload)
> >
> > Yes, the problem should exist for TCP too. Not sure if a prefetch
> > will help, but you can try.
>
> Indeed, I would be happy for any better solution!
Maybe we could adjust napi_gro_complete() etc. so the we can use
netif_receive_skb_list() on a gro flush. Cache would be still
cold, but we would not need to run the full stack for each
packet in the gro hashtable.
>
> The following is likely overkill, but... long time ago, I tried to
> implement early demux for unconnected sockets:
>
> https://www.spinics.net/lists/netdev/msg456384.html
>
> One idea behind that series is that, if there is no policy routing
> configuration in place, we can detect if a packet is locally terminated
> without any route lookup, with a way cheaper address lookup. *Perhaps*
> a similar idea could be used here -> in absence of UDP_GRO sockets, do
> UDP GRO only if the packet is not locally terminated (using an address
> lookup).
That would not help TCP and we can not use listified gro for the
local UDP input path (still not sure if we want to).
Powered by blists - more mailing lists