[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <7037c58d-d77d-bdd5-6c91-19cea3cbe539@itcare.pl>
Date: Sat, 10 Nov 2018 20:49:56 +0100
From: Paweł Staszewski <pstaszewski@...are.pl>
To: Jesper Dangaard Brouer <brouer@...hat.com>
Cc: Saeed Mahameed <saeedm@...lanox.com>,
"netdev@...r.kernel.org" <netdev@...r.kernel.org>
Subject: Re: Kernel 4.19 network performance - forwarding/routing normal users
traffic
W dniu 10.11.2018 o 20:34, Jesper Dangaard Brouer pisze:
> On Fri, 9 Nov 2018 23:20:38 +0100 Paweł Staszewski <pstaszewski@...are.pl> wrote:
>
>> W dniu 08.11.2018 o 20:12, Paweł Staszewski pisze:
>>> CPU load is lower than for connectx4 - but it looks like bandwidth
>>> limit is the same :)
>>> But also after reaching 60Gbit/60Gbit
>>>
>>> bwm-ng v0.6.1 (probing every 1.000s), press 'h' for help
>>> input: /proc/net/dev type: rate
>>> - iface Rx Tx Total
>>> ==========================================================================
>>>
>>> enp175s0: 45.09 Gb/s 15.09 Gb/s 60.18 Gb/s
>>> enp216s0: 15.14 Gb/s 45.19 Gb/s 60.33 Gb/s
>>> --------------------------------------------------------------------------
>>>
>>> total: 60.45 Gb/s 60.48 Gb/s 120.93 Gb/s
>> Today reached 65/65Gbit/s
>>
>> But starting from 60Gbit/s RX / 60Gbit TX nics start to drop packets
>> (with 50%CPU on all 28cores) - so still there is cpu power to use :).
> This is weird!
>
> How do you see / measure these drops?
Simple icmp test like ping -i 0.1
And im testing by icmp management ip address on vlan that is attacked to
one NIC (the side that is more stressed with RX)
And another icmp test is forward thru this router - host behind it
Both measurements shows same loss ratio from 0.1 to 0.5% after reaching
~45Gbit/s RX side - depends how much RX side is pushed drops vary
between 0.1 to 0.5 - even 0.6%:)
>
>
>> So checked other stats.
>> softnet_stats shows average 1k squeezed per sec:
> Is below output the raw counters? not per sec?
>
> It would be valuable to see the per sec stats instead...
> I use this tool:
> https://github.com/netoptimizer/network-testing/blob/master/bin/softnet_stat.pl
>
>> cpu total dropped squeezed collision rps flow_limit
>> 0 18554 0 1 0 0 0
>> 1 16728 0 1 0 0 0
>> 2 18033 0 1 0 0 0
>> 3 17757 0 1 0 0 0
>> 4 18861 0 0 0 0 0
>> 5 0 0 1 0 0 0
>> 6 2 0 1 0 0 0
>> 7 0 0 1 0 0 0
>> 8 0 0 0 0 0 0
>> 9 0 0 1 0 0 0
>> 10 0 0 0 0 0 0
>> 11 0 0 1 0 0 0
>> 12 50 0 1 0 0 0
>> 13 257 0 0 0 0 0
>> 14 3629115363 0 3353259 0 0 0
>> 15 255167835 0 3138271 0 0 0
>> 16 4240101961 0 3036130 0 0 0
>> 17 599810018 0 3072169 0 0 0
>> 18 432796524 0 3034191 0 0 0
>> 19 41803906 0 3037405 0 0 0
>> 20 900382666 0 3112294 0 0 0
>> 21 620926085 0 3086009 0 0 0
>> 22 41861198 0 3023142 0 0 0
>> 23 4090425574 0 2990412 0 0 0
>> 24 4264870218 0 3010272 0 0 0
>> 25 141401811 0 3027153 0 0 0
>> 26 104155188 0 3051251 0 0 0
>> 27 4261258691 0 3039765 0 0 0
>> 28 4 0 1 0 0 0
>> 29 4 0 0 0 0 0
>> 30 0 0 1 0 0 0
>> 31 0 0 0 0 0 0
>> 32 3 0 1 0 0 0
>> 33 1 0 1 0 0 0
>> 34 0 0 1 0 0 0
>> 35 0 0 0 0 0 0
>> 36 0 0 1 0 0 0
>> 37 0 0 1 0 0 0
>> 38 0 0 1 0 0 0
>> 39 0 0 1 0 0 0
>> 40 0 0 0 0 0 0
>> 41 0 0 1 0 0 0
>> 42 299758202 0 3139693 0 0 0
>> 43 4254727979 0 3103577 0 0 0
>> 44 1959555543 0 2554885 0 0 0
>> 45 1675702723 0 2513481 0 0 0
>> 46 1908435503 0 2519698 0 0 0
>> 47 1877799710 0 2537768 0 0 0
>> 48 2384274076 0 2584673 0 0 0
>> 49 2598104878 0 2593616 0 0 0
>> 50 1897566829 0 2530857 0 0 0
>> 51 1712741629 0 2489089 0 0 0
>> 52 1704033648 0 2495892 0 0 0
>> 53 1636781820 0 2499783 0 0 0
>> 54 1861997734 0 2541060 0 0 0
>> 55 2113521616 0 2555673 0 0 0
>>
>>
>> So i rised netdev backlog and budged to rly high values
>> 524288 for netdev_budget and same for backlog
> Does it affect the squeezed counters?
a little - but not much
After change budget from 65536 to to 524k - number of squeezed counters
for all cpus changed from 1.5k per second to 0.9-1k per second - but
increasing it more like above 524k change nothing - same 0.9 to 1k/s
squeezed
>
> Notice, this (crazy) huge netdev_budget limit will also be limited
> by /proc/sys/net/core/netdev_budget_usecs.
Yes changed that also to 1000 / 2000 / 3000 / 4000 not much difference
on squeezed - even cant see the difference
>
>> This rised sortirqs from about 600k/sec to 800k/sec for NET_TX/NET_RX
> Hmmm, this could indicated not enough NAPI bulking is occurring.
>
> I have a BPF tool, that can give you some insight into NAPI bulking and
> softirq idle/kthread starting. Called 'napi_monitor', could you try to
> run this, so can try to understand this? You find the tool here:
>
> https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/samples/bpf/
> https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/samples/bpf/napi_monitor_user.c
> https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/samples/bpf/napi_monitor_kern.c
yes will try it
>
>> But after this changes i have less packets drops.
>>
>>
>> Below perf top from max traffic reached:
>> PerfTop: 72230 irqs/sec kernel:99.4% exact: 0.0% [4000Hz
>> cycles], (all, 56 CPUs)
>> ------------------------------------------------------------------------------------------
>>
>> 12.62% [kernel] [k] mlx5e_skb_from_cqe_mpwrq_linear
>> 8.44% [kernel] [k] mlx5e_sq_xmit
>> 6.69% [kernel] [k] build_skb
>> 5.21% [kernel] [k] fib_table_lookup
>> 3.54% [kernel] [k] memcpy_erms
>> 3.20% [kernel] [k] mlx5e_poll_rx_cq
>> 2.25% [kernel] [k] vlan_do_receive
>> 2.20% [kernel] [k] mlx5e_post_rx_mpwqes
>> 2.02% [kernel] [k] mlx5e_handle_rx_cqe_mpwrq
>> 1.95% [kernel] [k] __dev_queue_xmit
>> 1.83% [kernel] [k] dev_gro_receive
>> 1.79% [kernel] [k] tcp_gro_receive
>> 1.73% [kernel] [k] ip_finish_output2
>> 1.63% [kernel] [k] mlx5e_poll_tx_cq
>> 1.49% [kernel] [k] ipt_do_table
>> 1.38% [kernel] [k] inet_gro_receive
>> 1.31% [kernel] [k] __netif_receive_skb_core
>> 1.30% [kernel] [k] _raw_spin_lock
>> 1.28% [kernel] [k] mlx5_eq_int
>> 1.24% [kernel] [k] irq_entries_start
>> 1.19% [kernel] [k] __build_skb
>> 1.15% [kernel] [k] swiotlb_map_page
>> 1.02% [kernel] [k] vlan_dev_hard_start_xmit
>> 0.94% [kernel] [k] pfifo_fast_dequeue
>> 0.92% [kernel] [k] ip_route_input_rcu
>> 0.86% [kernel] [k] kmem_cache_alloc
>> 0.80% [kernel] [k] mlx5e_xmit
>> 0.79% [kernel] [k] dev_hard_start_xmit
>> 0.78% [kernel] [k] _raw_spin_lock_irqsave
>> 0.74% [kernel] [k] ip_forward
>> 0.72% [kernel] [k] tasklet_action_common.isra.21
>> 0.68% [kernel] [k] pfifo_fast_enqueue
>> 0.67% [kernel] [k] netif_skb_features
>> 0.66% [kernel] [k] skb_segment
>> 0.60% [kernel] [k] skb_gro_receive
>> 0.56% [kernel] [k] validate_xmit_skb.isra.142
>> 0.53% [kernel] [k] skb_release_data
>> 0.51% [kernel] [k] mlx5e_page_release
>> 0.51% [kernel] [k] ip_rcv_core.isra.20.constprop.25
>> 0.51% [kernel] [k] __qdisc_run
>> 0.50% [kernel] [k] tcp4_gro_receive
>> 0.49% [kernel] [k] page_frag_free
>> 0.46% [kernel] [k] kmem_cache_free_bulk
>> 0.43% [kernel] [k] kmem_cache_free
>> 0.42% [kernel] [k] try_to_wake_up
>> 0.39% [kernel] [k] _raw_spin_lock_irq
>> 0.39% [kernel] [k] find_busiest_group
>> 0.37% [kernel] [k] __memcpy
>>
>>
>>
>> Remember those tests are now on two separate connectx5 connected to
>> two separate pcie x16 gen 3.0
>
> That is strange... I still suspect some HW NIC issue, can you provide
> ethtool stats info via tool:
>
> https://github.com/netoptimizer/network-testing/blob/master/bin/ethtool_stats.pl
>
> $ ethtool_stats.pl --dev enp175s0 --dev enp216s0
>
> The tool remove zero-stats counters and report per sec stats. It makes
> it easier to spot that is relevant for the given workload.
yes mlnx have just too many counters that are always 0 for my case :)
Will try this also
>
> Can you give output put from:
> $ ethtool --show-priv-flag DEVICE
>
> I want you to experiment with:
ethtool --show-priv-flags enp175s0
Private flags for enp175s0:
rx_cqe_moder : on
tx_cqe_moder : off
rx_cqe_compress : off
rx_striding_rq : on
rx_no_csum_complete: off
>
> ethtool --set-priv-flags DEVICE rx_striding_rq off
ok i will first check on test server if this will reset my interface and
will not produce kernel panic :)
>
> I think you already have played with 'rx_cqe_compress', right.
yes - and compress increasing number of irq's but doing not much for
bandwidth same limit 60-64Gbit/s total RX+TX on one 100G port
And what is weird - that limit is in overall symetric - cause if for
example 100G port is receiving 42G traffic and transmitting 20G traffic
- when i flood rx side with pktgen or other for example icmp traffic
1/2/3/4/5G - then receiving side increase with 1/2/3/4/5Gbit of traffic
but transmitting is going down for same lvl's
Powered by blists - more mailing lists