lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:   Sat, 10 Nov 2018 20:34:09 +0100
From:   Jesper Dangaard Brouer <brouer@...hat.com>
To:     Paweł Staszewski <pstaszewski@...are.pl>
Cc:     Saeed Mahameed <saeedm@...lanox.com>,
        "netdev@...r.kernel.org" <netdev@...r.kernel.org>,
        brouer@...hat.com
Subject: Re: Kernel 4.19 network performance - forwarding/routing normal
 users traffic

On Fri, 9 Nov 2018 23:20:38 +0100 Paweł Staszewski <pstaszewski@...are.pl> wrote:

> W dniu 08.11.2018 o 20:12, Paweł Staszewski pisze:
> > CPU load is lower than for connectx4 - but it looks like bandwidth 
> > limit is the same :)
> > But also after reaching 60Gbit/60Gbit
> >
> >  bwm-ng v0.6.1 (probing every 1.000s), press 'h' for help
> >   input: /proc/net/dev type: rate
> >   -         iface                   Rx Tx                Total
> > ==========================================================================
> >
> >          enp175s0:          45.09 Gb/s           15.09 Gb/s     60.18 Gb/s
> >          enp216s0:          15.14 Gb/s           45.19 Gb/s     60.33 Gb/s
> > -------------------------------------------------------------------------- 
> >
> >             total:          60.45 Gb/s           60.48 Gb/s 120.93 Gb/s   
> 
> Today reached 65/65Gbit/s
> 
> But starting from 60Gbit/s RX / 60Gbit TX nics start to drop packets 
> (with 50%CPU on all 28cores) - so still there is cpu power to use :).

This is weird!

How do you see / measure these drops?


> So checked other stats.
> softnet_stats shows average 1k squeezed per sec:

Is below output the raw counters? not per sec?

It would be valuable to see the per sec stats instead...
I use this tool:
 https://github.com/netoptimizer/network-testing/blob/master/bin/softnet_stat.pl

> cpu      total    dropped   squeezed  collision        rps flow_limit
>    0      18554          0          1          0          0 0
>    1      16728          0          1          0          0 0
>    2      18033          0          1          0          0 0
>    3      17757          0          1          0          0 0
>    4      18861          0          0          0          0 0
>    5          0          0          1          0          0 0
>    6          2          0          1          0          0 0
>    7          0          0          1          0          0 0
>    8          0          0          0          0          0 0
>    9          0          0          1          0          0 0
>   10          0          0          0          0          0 0
>   11          0          0          1          0          0 0
>   12         50          0          1          0          0 0
>   13        257          0          0          0          0 0
>   14 3629115363          0    3353259          0          0 0
>   15  255167835          0    3138271          0          0 0
>   16 4240101961          0    3036130          0          0 0
>   17  599810018          0    3072169          0          0 0
>   18  432796524          0    3034191          0          0 0
>   19   41803906          0    3037405          0          0 0
>   20  900382666          0    3112294          0          0 0
>   21  620926085          0    3086009          0          0 0
>   22   41861198          0    3023142          0          0 0
>   23 4090425574          0    2990412          0          0 0
>   24 4264870218          0    3010272          0          0 0
>   25  141401811          0    3027153          0          0 0
>   26  104155188          0    3051251          0          0 0
>   27 4261258691          0    3039765          0          0 0
>   28          4          0          1          0          0 0
>   29          4          0          0          0          0 0
>   30          0          0          1          0          0 0
>   31          0          0          0          0          0 0
>   32          3          0          1          0          0 0
>   33          1          0          1          0          0 0
>   34          0          0          1          0          0 0
>   35          0          0          0          0          0 0
>   36          0          0          1          0          0 0
>   37          0          0          1          0          0 0
>   38          0          0          1          0          0 0
>   39          0          0          1          0          0 0
>   40          0          0          0          0          0 0
>   41          0          0          1          0          0 0
>   42  299758202          0    3139693          0          0 0
>   43 4254727979          0    3103577          0          0 0
>   44 1959555543          0    2554885          0          0 0
>   45 1675702723          0    2513481          0          0 0
>   46 1908435503          0    2519698          0          0 0
>   47 1877799710          0    2537768          0          0 0
>   48 2384274076          0    2584673          0          0 0
>   49 2598104878          0    2593616          0          0 0
>   50 1897566829          0    2530857          0          0 0
>   51 1712741629          0    2489089          0          0 0
>   52 1704033648          0    2495892          0          0 0
>   53 1636781820          0    2499783          0          0 0
>   54 1861997734          0    2541060          0          0 0
>   55 2113521616          0    2555673          0          0 0
> 
> 
> So i rised netdev backlog and budged to rly high values
> 524288 for netdev_budget and same for backlog

Does it affect the squeezed counters?

Notice, this (crazy) huge netdev_budget limit will also be limited
by /proc/sys/net/core/netdev_budget_usecs.

> This rised sortirqs from about 600k/sec to 800k/sec for NET_TX/NET_RX

Hmmm, this could indicated not enough NAPI bulking is occurring.

I have a BPF tool, that can give you some insight into NAPI bulking and
softirq idle/kthread starting. Called 'napi_monitor', could you try to
run this, so can try to understand this? You find the tool here:

 https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/samples/bpf/
 https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/samples/bpf/napi_monitor_user.c
 https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/samples/bpf/napi_monitor_kern.c
 
> But after this changes i have less packets drops.
> 
> 
> Below perf top from max traffic reached:
>     PerfTop:   72230 irqs/sec  kernel:99.4%  exact:  0.0% [4000Hz
> cycles],  (all, 56 CPUs)
> ------------------------------------------------------------------------------------------
> 
>      12.62%  [kernel]       [k] mlx5e_skb_from_cqe_mpwrq_linear
>       8.44%  [kernel]       [k] mlx5e_sq_xmit
>       6.69%  [kernel]       [k] build_skb
>       5.21%  [kernel]       [k] fib_table_lookup
>       3.54%  [kernel]       [k] memcpy_erms
>       3.20%  [kernel]       [k] mlx5e_poll_rx_cq
>       2.25%  [kernel]       [k] vlan_do_receive
>       2.20%  [kernel]       [k] mlx5e_post_rx_mpwqes
>       2.02%  [kernel]       [k] mlx5e_handle_rx_cqe_mpwrq
>       1.95%  [kernel]       [k] __dev_queue_xmit
>       1.83%  [kernel]       [k] dev_gro_receive
>       1.79%  [kernel]       [k] tcp_gro_receive
>       1.73%  [kernel]       [k] ip_finish_output2
>       1.63%  [kernel]       [k] mlx5e_poll_tx_cq
>       1.49%  [kernel]       [k] ipt_do_table
>       1.38%  [kernel]       [k] inet_gro_receive
>       1.31%  [kernel]       [k] __netif_receive_skb_core
>       1.30%  [kernel]       [k] _raw_spin_lock
>       1.28%  [kernel]       [k] mlx5_eq_int
>       1.24%  [kernel]       [k] irq_entries_start
>       1.19%  [kernel]       [k] __build_skb
>       1.15%  [kernel]       [k] swiotlb_map_page
>       1.02%  [kernel]       [k] vlan_dev_hard_start_xmit
>       0.94%  [kernel]       [k] pfifo_fast_dequeue
>       0.92%  [kernel]       [k] ip_route_input_rcu
>       0.86%  [kernel]       [k] kmem_cache_alloc
>       0.80%  [kernel]       [k] mlx5e_xmit
>       0.79%  [kernel]       [k] dev_hard_start_xmit
>       0.78%  [kernel]       [k] _raw_spin_lock_irqsave
>       0.74%  [kernel]       [k] ip_forward
>       0.72%  [kernel]       [k] tasklet_action_common.isra.21
>       0.68%  [kernel]       [k] pfifo_fast_enqueue
>       0.67%  [kernel]       [k] netif_skb_features
>       0.66%  [kernel]       [k] skb_segment
>       0.60%  [kernel]       [k] skb_gro_receive
>       0.56%  [kernel]       [k] validate_xmit_skb.isra.142
>       0.53%  [kernel]       [k] skb_release_data
>       0.51%  [kernel]       [k] mlx5e_page_release
>       0.51%  [kernel]       [k] ip_rcv_core.isra.20.constprop.25
>       0.51%  [kernel]       [k] __qdisc_run
>       0.50%  [kernel]       [k] tcp4_gro_receive
>       0.49%  [kernel]       [k] page_frag_free
>       0.46%  [kernel]       [k] kmem_cache_free_bulk
>       0.43%  [kernel]       [k] kmem_cache_free
>       0.42%  [kernel]       [k] try_to_wake_up
>       0.39%  [kernel]       [k] _raw_spin_lock_irq
>       0.39%  [kernel]       [k] find_busiest_group
>       0.37%  [kernel]       [k] __memcpy
> 
> 
> 
> Remember those tests are now on two separate connectx5 connected to
> two separate pcie x16  gen 3.0
 
That is strange... I still suspect some HW NIC issue, can you provide
ethtool stats info via tool:

 https://github.com/netoptimizer/network-testing/blob/master/bin/ethtool_stats.pl

$ ethtool_stats.pl --dev enp175s0 --dev enp216s0

The tool remove zero-stats counters and report per sec stats.  It makes
it easier to spot that is relevant for the given workload.

Can you give output put from:
 $ ethtool --show-priv-flag DEVICE

I want you to experiment with:

 ethtool --set-priv-flags DEVICE rx_striding_rq off 

I think you already have played with 'rx_cqe_compress', right.
-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ