[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20181110203409.482f39ec@redhat.com>
Date: Sat, 10 Nov 2018 20:34:09 +0100
From: Jesper Dangaard Brouer <brouer@...hat.com>
To: Paweł Staszewski <pstaszewski@...are.pl>
Cc: Saeed Mahameed <saeedm@...lanox.com>,
"netdev@...r.kernel.org" <netdev@...r.kernel.org>,
brouer@...hat.com
Subject: Re: Kernel 4.19 network performance - forwarding/routing normal
users traffic
On Fri, 9 Nov 2018 23:20:38 +0100 Paweł Staszewski <pstaszewski@...are.pl> wrote:
> W dniu 08.11.2018 o 20:12, Paweł Staszewski pisze:
> > CPU load is lower than for connectx4 - but it looks like bandwidth
> > limit is the same :)
> > But also after reaching 60Gbit/60Gbit
> >
> > bwm-ng v0.6.1 (probing every 1.000s), press 'h' for help
> > input: /proc/net/dev type: rate
> > - iface Rx Tx Total
> > ==========================================================================
> >
> > enp175s0: 45.09 Gb/s 15.09 Gb/s 60.18 Gb/s
> > enp216s0: 15.14 Gb/s 45.19 Gb/s 60.33 Gb/s
> > --------------------------------------------------------------------------
> >
> > total: 60.45 Gb/s 60.48 Gb/s 120.93 Gb/s
>
> Today reached 65/65Gbit/s
>
> But starting from 60Gbit/s RX / 60Gbit TX nics start to drop packets
> (with 50%CPU on all 28cores) - so still there is cpu power to use :).
This is weird!
How do you see / measure these drops?
> So checked other stats.
> softnet_stats shows average 1k squeezed per sec:
Is below output the raw counters? not per sec?
It would be valuable to see the per sec stats instead...
I use this tool:
https://github.com/netoptimizer/network-testing/blob/master/bin/softnet_stat.pl
> cpu total dropped squeezed collision rps flow_limit
> 0 18554 0 1 0 0 0
> 1 16728 0 1 0 0 0
> 2 18033 0 1 0 0 0
> 3 17757 0 1 0 0 0
> 4 18861 0 0 0 0 0
> 5 0 0 1 0 0 0
> 6 2 0 1 0 0 0
> 7 0 0 1 0 0 0
> 8 0 0 0 0 0 0
> 9 0 0 1 0 0 0
> 10 0 0 0 0 0 0
> 11 0 0 1 0 0 0
> 12 50 0 1 0 0 0
> 13 257 0 0 0 0 0
> 14 3629115363 0 3353259 0 0 0
> 15 255167835 0 3138271 0 0 0
> 16 4240101961 0 3036130 0 0 0
> 17 599810018 0 3072169 0 0 0
> 18 432796524 0 3034191 0 0 0
> 19 41803906 0 3037405 0 0 0
> 20 900382666 0 3112294 0 0 0
> 21 620926085 0 3086009 0 0 0
> 22 41861198 0 3023142 0 0 0
> 23 4090425574 0 2990412 0 0 0
> 24 4264870218 0 3010272 0 0 0
> 25 141401811 0 3027153 0 0 0
> 26 104155188 0 3051251 0 0 0
> 27 4261258691 0 3039765 0 0 0
> 28 4 0 1 0 0 0
> 29 4 0 0 0 0 0
> 30 0 0 1 0 0 0
> 31 0 0 0 0 0 0
> 32 3 0 1 0 0 0
> 33 1 0 1 0 0 0
> 34 0 0 1 0 0 0
> 35 0 0 0 0 0 0
> 36 0 0 1 0 0 0
> 37 0 0 1 0 0 0
> 38 0 0 1 0 0 0
> 39 0 0 1 0 0 0
> 40 0 0 0 0 0 0
> 41 0 0 1 0 0 0
> 42 299758202 0 3139693 0 0 0
> 43 4254727979 0 3103577 0 0 0
> 44 1959555543 0 2554885 0 0 0
> 45 1675702723 0 2513481 0 0 0
> 46 1908435503 0 2519698 0 0 0
> 47 1877799710 0 2537768 0 0 0
> 48 2384274076 0 2584673 0 0 0
> 49 2598104878 0 2593616 0 0 0
> 50 1897566829 0 2530857 0 0 0
> 51 1712741629 0 2489089 0 0 0
> 52 1704033648 0 2495892 0 0 0
> 53 1636781820 0 2499783 0 0 0
> 54 1861997734 0 2541060 0 0 0
> 55 2113521616 0 2555673 0 0 0
>
>
> So i rised netdev backlog and budged to rly high values
> 524288 for netdev_budget and same for backlog
Does it affect the squeezed counters?
Notice, this (crazy) huge netdev_budget limit will also be limited
by /proc/sys/net/core/netdev_budget_usecs.
> This rised sortirqs from about 600k/sec to 800k/sec for NET_TX/NET_RX
Hmmm, this could indicated not enough NAPI bulking is occurring.
I have a BPF tool, that can give you some insight into NAPI bulking and
softirq idle/kthread starting. Called 'napi_monitor', could you try to
run this, so can try to understand this? You find the tool here:
https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/samples/bpf/
https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/samples/bpf/napi_monitor_user.c
https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/samples/bpf/napi_monitor_kern.c
> But after this changes i have less packets drops.
>
>
> Below perf top from max traffic reached:
> PerfTop: 72230 irqs/sec kernel:99.4% exact: 0.0% [4000Hz
> cycles], (all, 56 CPUs)
> ------------------------------------------------------------------------------------------
>
> 12.62% [kernel] [k] mlx5e_skb_from_cqe_mpwrq_linear
> 8.44% [kernel] [k] mlx5e_sq_xmit
> 6.69% [kernel] [k] build_skb
> 5.21% [kernel] [k] fib_table_lookup
> 3.54% [kernel] [k] memcpy_erms
> 3.20% [kernel] [k] mlx5e_poll_rx_cq
> 2.25% [kernel] [k] vlan_do_receive
> 2.20% [kernel] [k] mlx5e_post_rx_mpwqes
> 2.02% [kernel] [k] mlx5e_handle_rx_cqe_mpwrq
> 1.95% [kernel] [k] __dev_queue_xmit
> 1.83% [kernel] [k] dev_gro_receive
> 1.79% [kernel] [k] tcp_gro_receive
> 1.73% [kernel] [k] ip_finish_output2
> 1.63% [kernel] [k] mlx5e_poll_tx_cq
> 1.49% [kernel] [k] ipt_do_table
> 1.38% [kernel] [k] inet_gro_receive
> 1.31% [kernel] [k] __netif_receive_skb_core
> 1.30% [kernel] [k] _raw_spin_lock
> 1.28% [kernel] [k] mlx5_eq_int
> 1.24% [kernel] [k] irq_entries_start
> 1.19% [kernel] [k] __build_skb
> 1.15% [kernel] [k] swiotlb_map_page
> 1.02% [kernel] [k] vlan_dev_hard_start_xmit
> 0.94% [kernel] [k] pfifo_fast_dequeue
> 0.92% [kernel] [k] ip_route_input_rcu
> 0.86% [kernel] [k] kmem_cache_alloc
> 0.80% [kernel] [k] mlx5e_xmit
> 0.79% [kernel] [k] dev_hard_start_xmit
> 0.78% [kernel] [k] _raw_spin_lock_irqsave
> 0.74% [kernel] [k] ip_forward
> 0.72% [kernel] [k] tasklet_action_common.isra.21
> 0.68% [kernel] [k] pfifo_fast_enqueue
> 0.67% [kernel] [k] netif_skb_features
> 0.66% [kernel] [k] skb_segment
> 0.60% [kernel] [k] skb_gro_receive
> 0.56% [kernel] [k] validate_xmit_skb.isra.142
> 0.53% [kernel] [k] skb_release_data
> 0.51% [kernel] [k] mlx5e_page_release
> 0.51% [kernel] [k] ip_rcv_core.isra.20.constprop.25
> 0.51% [kernel] [k] __qdisc_run
> 0.50% [kernel] [k] tcp4_gro_receive
> 0.49% [kernel] [k] page_frag_free
> 0.46% [kernel] [k] kmem_cache_free_bulk
> 0.43% [kernel] [k] kmem_cache_free
> 0.42% [kernel] [k] try_to_wake_up
> 0.39% [kernel] [k] _raw_spin_lock_irq
> 0.39% [kernel] [k] find_busiest_group
> 0.37% [kernel] [k] __memcpy
>
>
>
> Remember those tests are now on two separate connectx5 connected to
> two separate pcie x16 gen 3.0
That is strange... I still suspect some HW NIC issue, can you provide
ethtool stats info via tool:
https://github.com/netoptimizer/network-testing/blob/master/bin/ethtool_stats.pl
$ ethtool_stats.pl --dev enp175s0 --dev enp216s0
The tool remove zero-stats counters and report per sec stats. It makes
it easier to spot that is relevant for the given workload.
Can you give output put from:
$ ethtool --show-priv-flag DEVICE
I want you to experiment with:
ethtool --set-priv-flags DEVICE rx_striding_rq off
I think you already have played with 'rx_cqe_compress', right.
--
Best regards,
Jesper Dangaard Brouer
MSc.CS, Principal Kernel Engineer at Red Hat
LinkedIn: http://www.linkedin.com/in/brouer
Powered by blists - more mailing lists