[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <1506371169.2614.3.camel@redhat.com>
Date: Mon, 25 Sep 2017 22:26:09 +0200
From: Paolo Abeni <pabeni@...hat.com>
To: Eric Dumazet <eric.dumazet@...il.com>
Cc: netdev@...r.kernel.org, "David S. Miller" <davem@...emloft.net>,
Pablo Neira Ayuso <pablo@...filter.org>,
Florian Westphal <fw@...len.de>,
Eric Dumazet <edumazet@...gle.com>,
Hannes Frederic Sowa <hannes@...essinduktion.org>
Subject: Re: [RFC PATCH 00/11] udp: full early demux for unconnected sockets
On Fri, 2017-09-22 at 14:58 -0700, Eric Dumazet wrote:
> On Fri, 2017-09-22 at 23:06 +0200, Paolo Abeni wrote:
> > This series refactor the UDP early demux code so that:
> >
> > * full socket lookup is performed for unicast packets
> > * a sk is grabbed even for unconnected socket match
> > * a dst cache is used even in such scenario
> >
> > To perform this tasks a couple of facilities are added:
> >
> > * noref socket references, scoped inside the current RCU section, to be
> > explicitly cleared before leaving such section
> > * a dst cache inside the inet and inet6 local addresses tables, caching the
> > related local dst entry
> >
> > The measured performance gain under small packet UDP flood is as follow:
> >
> > ingress NIC vanilla patched delta
> > rx queues (kpps) (kpps) (%)
> > [ipv4]
> > 1 2177 2414 10
> > 2 2527 2892 14
> > 3 3050 3733 22
>
>
> This is a clear sign your program is not using latest SO_REUSEPORT +
> [ec]BPF filter [1]
>
> return socket[RX_QUEUE# | or CPU#];
>
> If udp_sink uses SO_REUSEPORT with no extra hint, socket selection is
> based on a lazy hash, meaning that you do not have proper siloing.
>
> return socket[hash(skb)];
>
> Multiple cpus can then :
> - compete on grabbing same socket refcount
> - compete on grabbing the receive queue lock
> - compete for releasing lock and socket refcount
> - skb freeing done on different cpus than where allocated.
>
> You are adding complexity to the kernel because you are using a
> sub-optimal user space program, favoring false sharing.
>
> First solve the false sharing issue.
>
> Performance with 2 rx queues should be almost twice the performance with
> 1 rx queue.
>
> Then we can see if the gains you claim are still applicable.
Here are the performance results using a BPF filter to distribute the
ingress packet to the reuseport socket with the same id of the ingress
CPU - we have 1 to 1 mapping between the ingress receive queue and the
destination socket:
ingress NIC vanilla patched delta
rx queues (kpps) (kpps) (%)
[ipv4]
2 3020 3663 21
3 4352 5179 19
4 5318 6194 16
5 6258 7583 21
6 7376 8558 16
[ipv6]
2 2446 3949 61
3 3099 5092 64
4 3698 6611 78
5 4382 7852 79
6 5116 8851 73
Sone notes:
- figures obtained with:
ethtool -L em2 combined $n
MASK=1
for I in `seq 0 $((n - 1))`; do
[ $I -eq 0 ] && USE_BPF="--use_bpf" || USE_BPF=""
udp_sink --reuseport $USE_BPF --recvfrom --count 10000000 --port 9 &
taskset -p $((MASK << ($I + $n) )) $!
done
- in the IPv6 routing code we currently have a relevant bottle-neck in
ip6_pol_route(), I see a lot of contention on a dst refcount, so
without early demux the performances do not scale well there.
- For maximum performances BH and user space sink need to run on
difference CPUs - yes we have some more cacheline misses and a little
contention on the receive queue spin lock, but a lot less icache misses
and more CPU cycles available, the overall tput is a lot higher than
binding on the same CPU where the BH is running.
> PS: Wei Wan is about to release the IPV6 changes so that the big
> differences you showed are going to disappear soon.
Interesting, looking forward to that!
Cheers,
Paolo
Powered by blists - more mailing lists