netdev - Re: [PATCH net-next 0/5] net: add protocol level recvmmsg support

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20161128145241.4c1b083d@redhat.com>
Date:   Mon, 28 Nov 2016 14:52:41 +0100
From:   Jesper Dangaard Brouer <brouer@...hat.com>
To:     Paolo Abeni <pabeni@...hat.com>
Cc:     netdev@...r.kernel.org, "David S. Miller" <davem@...emloft.net>,
        Eric Dumazet <edumazet@...gle.com>,
        Hannes Frederic Sowa <hannes@...essinduktion.org>,
        Sabrina Dubroca <sd@...asysnail.net>, brouer@...hat.com
Subject: Re: [PATCH net-next 0/5] net: add protocol level recvmmsg support


On Mon, 28 Nov 2016 13:21:41 +0100 Jesper Dangaard Brouer <brouer@...hat.com> wrote:
> On Mon, 28 Nov 2016 11:52:38 +0100 Paolo Abeni <pabeni@...hat.com> wrote:
> >   
> > > > [2] like [1], but using the minimum number of flows to saturate the user space
> > > >  sink, that is 1 flow for the old kernel and 3 for the patched one.
> > > >  the tput increases since the contention on the rx lock is low.
> > > > [3] like [1] but using a single flow with both old and new kernel. All the
> > > >  packets land on the same rx queue and there is a single ksoftirqd instance
> > > >  running    
[...]
> > 
> > We also used connected socket for test[3], with relative little
> > difference (the tput increased for both unpatched and patched kernel, 
> > and the difference was roughly the same).  
> 
> When I use connected sockets (RX side) and ip_early_demux enabled, I do
> see a performance boost for recvmmsg.  With these patches applied,
> forced ksoftirqd on CPU0 and udp_sink on CPU2, pktgen single flow
> sending size 1472 bytes.
> 
> $ sysctl net/ipv4/ip_early_demux
> net.ipv4.ip_early_demux = 1
> 
> $ grep -H . /proc/sys/net/core/{r,w}mem_max
> /proc/sys/net/core/rmem_max:1048576
> /proc/sys/net/core/wmem_max:1048576
> 
> # taskset -c 2 ./udp_sink --count $((10**7)) --port 9 --repeat 1
> #                               ns      pps             cycles
> recvMmsg/32  	run: 0 10000000	462.51	2162095.23	1853
> recvmsg   	run: 0 10000000	536.47	1864041.75	2150
> read      	run: 0 10000000	492.01	2032460.71	1972
> recvfrom  	run: 0 10000000	553.94	1805262.84	2220
> 
> # taskset -c 2 ./udp_sink --count $((10**7)) --port 9 --repeat 1 --connect
> #                               ns      pps             cycles
> recvMmsg/32  	run: 0 10000000	405.15	2468225.03	1623
> recvmsg   	run: 0 10000000	548.23	1824049.58	2197
> read      	run: 0 10000000	489.76	2041825.27	1962
> recvfrom  	run: 0 10000000	466.18	2145091.77	1868
> 
> My theory is that by enabling connect'ed RX socket, the ksoftirqd gets
> faster (no fib_lookup) and is no-longer a bottleneck.  This is
> confirmed by nstat.

Paolo asked me to do a test with small packets with pktgen, and I was
actually surprised by the result.

# taskset -c 2 ./udp_sink --count $((10**7)) --port 9 --repeat 1 --connect
recvMmsg/32  	run: 0 10000000	426.61	2344076.59	1709	17098657328
recvmsg   	run: 0 10000000	533.49	1874449.82	2138	21382574965
read      	run: 0 10000000	470.22	2126651.13	1884	18846797802
recvfrom  	run: 0 10000000	513.74	1946499.83	2059	20591095477

Notice how recvMmsg/32, got slower with 124kpps (2468225 pps -> 2344076 pps).
I was expecting it to get faster, given we just established udp_sink
was the bottleneck, and smaller packet should mean less copy of bytes
to userspace (copy_user_enhanced_fast_string). (With nstat I observe
ksoftirq is again the bottleneck).

Looking at perf diff of CPU2 (baseline=64Bytes) we do see an increase
of copy_user_enhanced_fast_string.  More interestingly we see a
decrease in the locking cost when using big packets (see ** below)

# Event 'cycles:ppp'
#
# Baseline    Delta  Shared Object     Symbol                                   
# ........  .......  ................  .........................................
#
    15.09%   +0.33%  [kernel.vmlinux]  [k] copy_msghdr_from_user
    12.36%  +21.89%  [kernel.vmlinux]  [k] copy_user_enhanced_fast_string
     8.65%   -0.63%  [kernel.vmlinux]  [k] udp_process_skb
     7.33%   -1.88%  [kernel.vmlinux]  [k] __skb_try_recv_datagram_batch
 **  7.12%   -6.66%  [kernel.vmlinux]  [k] udp_rmem_release **
 **  6.71%   -6.52%  [kernel.vmlinux]  [k] _raw_spin_lock_bh **
     6.35%   +1.36%  [kernel.vmlinux]  [k] __free_page_frag
     4.39%   +0.29%  [kernel.vmlinux]  [k] copy_msghdr_to_user_gen
     2.87%   -1.52%  [kernel.vmlinux]  [k] skb_release_data
     2.60%   +0.14%  [kernel.vmlinux]  [k] __put_user_4
     2.27%   -2.18%  [kernel.vmlinux]  [k] __sk_mem_reduce_allocated
     2.11%   +0.08%  [kernel.vmlinux]  [k] cmpxchg_double_slab.isra.68
     1.90%   +2.40%  [kernel.vmlinux]  [k] __slab_free
     1.73%   +0.20%  [kernel.vmlinux]  [k] __udp_recvmmsg
     1.62%   -1.62%  [kernel.vmlinux]  [k] intel_idle
     1.52%   +0.22%  [kernel.vmlinux]  [k] copy_to_iter
     1.20%   -0.03%  [kernel.vmlinux]  [k] import_iovec
     1.14%   +0.05%  [kernel.vmlinux]  [k] rw_copy_check_uvector
     0.80%   -0.04%  [kernel.vmlinux]  [k] recvmmsg_ctx_to_user
     0.75%   -0.69%  [kernel.vmlinux]  [k] __local_bh_enable_ip
     0.71%   +0.18%  [kernel.vmlinux]  [k] skb_copy_datagram_iter
     0.70%   -0.07%  [kernel.vmlinux]  [k] recvmmsg_ctx_from_user
     0.67%   +0.08%  [kernel.vmlinux]  [k] kmem_cache_free
     0.56%   +0.42%  [kernel.vmlinux]  [k] udp_process_msg
     0.48%   +0.05%  [kernel.vmlinux]  [k] skb_release_head_state
     0.46%           [kernel.vmlinux]  [k] lapic_next_deadline
     0.36%           [kernel.vmlinux]  [k] __switch_to
     0.34%   -0.03%  [kernel.vmlinux]  [k] consume_skb
     0.32%   -0.05%  [kernel.vmlinux]  [k] skb_consume_udp


The perf diff from CPU0, also show less lock congestion:

# Event 'cycles:ppp'
#
# Baseline    Delta  Shared Object     Symbol                                   
# ........  .......  ................  .........................................
#
    11.04%   -3.02%  [kernel.vmlinux]  [k] __udp_enqueue_schedule_skb
     9.98%   +2.16%  [mlx5_core]       [k] mlx5e_handle_rx_cqe
     7.23%   -1.85%  [kernel.vmlinux]  [k] udp_v4_early_demux
     3.90%   +0.73%  [kernel.vmlinux]  [k] build_skb
     3.85%   -1.77%  [kernel.vmlinux]  [k] udp_queue_rcv_skb
     3.83%   +0.02%  [kernel.vmlinux]  [k] sock_def_readable
 **  3.26%   -3.19%  [kernel.vmlinux]  [k] queued_spin_lock_slowpath **
     2.99%   +0.55%  [kernel.vmlinux]  [k] __build_skb
     2.97%   +0.11%  [kernel.vmlinux]  [k] __udp4_lib_rcv
 **  2.87%   -1.39%  [kernel.vmlinux]  [k] _raw_spin_lock **
     2.67%   +0.60%  [kernel.vmlinux]  [k] ip_rcv
     2.65%   +0.61%  [kernel.vmlinux]  [k] __netif_receive_skb_core
     2.64%   +0.79%  [ip_tables]       [k] ipt_do_table
     2.37%   +0.37%  [kernel.vmlinux]  [k] read_tsc
     2.26%   +0.52%  [mlx5_core]       [k] mlx5e_get_cqe
     2.11%   -1.15%  [kernel.vmlinux]  [k] __sk_mem_raise_allocated
     2.10%   +0.37%  [kernel.vmlinux]  [k] __rcu_read_unlock
     2.04%   +0.67%  [mlx5_core]       [k] mlx5e_alloc_rx_wqe
     1.86%   +0.40%  [kernel.vmlinux]  [k] inet_gro_receive
     1.57%   +0.11%  [kernel.vmlinux]  [k] kmem_cache_alloc
     1.53%   +0.28%  [kernel.vmlinux]  [k] _raw_read_lock
     1.53%   +0.25%  [kernel.vmlinux]  [k] dev_gro_receive
     1.38%   -0.18%  [kernel.vmlinux]  [k] udp_gro_receive
     1.19%   +0.37%  [kernel.vmlinux]  [k] __rcu_read_lock
     1.14%   +0.31%  [kernel.vmlinux]  [k] _raw_read_unlock
     1.14%   +0.12%  [kernel.vmlinux]  [k] ip_rcv_finish
     1.13%   +0.20%  [kernel.vmlinux]  [k] __udp4_lib_lookup
     1.05%   +0.16%  [kernel.vmlinux]  [k] ktime_get_with_offset
     0.94%   +0.38%  [kernel.vmlinux]  [k] ip_local_deliver_finish
     0.91%   +0.22%  [kernel.vmlinux]  [k] do_csum
     0.86%   -0.04%  [kernel.vmlinux]  [k] ipv4_pktinfo_prepare
     0.84%   +0.05%  [kernel.vmlinux]  [k] sk_filter_trim_cap
     0.84%   +0.20%  [kernel.vmlinux]  [k] ip_local_deliver
     0.84%   +0.19%  [kernel.vmlinux]  [k] udp4_gro_receive

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer