netdev - Re: Kernel 4.19 network performance - forwarding/routing normal users traffic

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20181101102213.2fa2643d@redhat.com>
Date:   Thu, 1 Nov 2018 10:22:13 +0100
From:   Jesper Dangaard Brouer <brouer@...hat.com>
To:     Paweł Staszewski <pstaszewski@...are.pl>
Cc:     brouer@...hat.com, Eric Dumazet <eric.dumazet@...il.com>,
        netdev <netdev@...r.kernel.org>,
        Tariq Toukan <tariqt@...lanox.com>,
        Ilias Apalodimas <ilias.apalodimas@...aro.org>,
        Yoel Caspersen <yoel@...knet.dk>,
        Mel Gorman <mgorman@...hsingularity.net>,
        Aaron Lu <aaron.lu@...el.com>
Subject: Re: Kernel 4.19 network performance - forwarding/routing normal
 users traffic

On Wed, 31 Oct 2018 23:20:01 +0100
Paweł Staszewski <pstaszewski@...are.pl> wrote:

> W dniu 31.10.2018 o 23:09, Eric Dumazet pisze:
> >
> > On 10/31/2018 02:57 PM, Paweł Staszewski wrote:  
> >> Hi
> >>
> >> So maybee someone will be interested how linux kernel handles
> >> normal traffic (not pktgen :) )

Pawel is this live production traffic?

I know Yoel (Cc) is very interested to know the real-life limitation of
Linux as a router, especially with VLANs like you use.


> >>
> >> Server HW configuration:
> >>
> >> CPU : Intel(R) Xeon(R) Gold 6132 CPU @ 2.60GHz
> >>
> >> NIC's: 2x 100G Mellanox ConnectX-4 (connected to x16 pcie 8GT)
> >>
> >>
> >> Server software:
> >>
> >> FRR - as routing daemon
> >>
> >> enp175s0f0 (100G) - 16 vlans from upstreams (28 RSS binded to local numa node)
> >>
> >> enp175s0f1 (100G) - 343 vlans to clients (28 RSS binded to local numa node)
> >>
> >>
> >> Maximum traffic that server can handle:
> >>
> >> Bandwidth
> >>
> >>   bwm-ng v0.6.1 (probing every 1.000s), press 'h' for help
> >>    input: /proc/net/dev type: rate
> >>    \         iface                   Rx Tx                Total
> >> ==============================================================================
> >>         enp175s0f1:          28.51 Gb/s           37.24 Gb/s           65.74 Gb/s
> >>         enp175s0f0:          38.07 Gb/s           28.44 Gb/s           66.51 Gb/s
> >> ------------------------------------------------------------------------------
> >>              total:          66.58 Gb/s           65.67 Gb/s          132.25 Gb/s
> >>

Actually rather impressive number for a Linux router.

> >>
> >> Packets per second:
> >>
> >>   bwm-ng v0.6.1 (probing every 1.000s), press 'h' for help
> >>    input: /proc/net/dev type: rate
> >>    -         iface                   Rx Tx                Total
> >> ==============================================================================
> >>         enp175s0f1:      5248589.00 P/s       3486617.75 P/s 8735207.00 P/s
> >>         enp175s0f0:      3557944.25 P/s       5232516.00 P/s 8790460.00 P/s
> >> ------------------------------------------------------------------------------
> >>              total:      8806533.00 P/s       8719134.00 P/s 17525668.00 P/s
> >>

Average packet size:
  (28.51*10^9/8)/5248589 =  678.99 bytes 
  (38.07*10^9/8)/3557944 = 1337.49 bytes


> >> After reaching that limits nics on the upstream side (more RX
> >> traffic) start to drop packets
> >>
> >>
> >> I just dont understand that server can't handle more bandwidth
> >> (~40Gbit/s is limit where all cpu's are 100% util) - where pps on
> >> RX side are increasing.
> >>
> >> Was thinking that maybee reached some pcie x16 limit - but x16 8GT
> >> is 126Gbit - and also when testing with pktgen i can reach more bw
> >> and pps (like 4x more comparing to normal internet traffic)
> >>
> >> And wondering if there is something that can be improved here.
> >>
> >>
> >>
> >> Some more informations / counters / stats and perf top below:
> >>
> >> Perf top flame graph:
> >>
> >> https://uploadfiles.io/7zo6u

Thanks a lot for the flame graph!

> >>
> >> System configuration(long):
> >>
> >>
> >> cat /sys/devices/system/node/node1/cpulist
> >> 14-27,42-55
> >> cat /sys/class/net/enp175s0f0/device/numa_node
> >> 1
> >> cat /sys/class/net/enp175s0f1/device/numa_node
> >> 1
> >>

Hint grep can give you nicer output that cat:

$ grep -H . /sys/class/net/*/device/numa_node

> >>
> >>
> >>
> >>
> >> ip -s -d link ls dev enp175s0f0
> >> 6: enp175s0f0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP mode DEFAULT group default qlen 8192
> >>      link/ether 0c:c4:7a:d8:5d:1c brd ff:ff:ff:ff:ff:ff promiscuity 0 addrgenmode eui64 numtxqueues 448 numrxqueues 56 gso_max_size 65536 gso_max_segs 65535
> >>      RX: bytes  packets  errors  dropped overrun mcast
> >>      184142375840858 141347715974 2       2806325 0       85050528
> >>      TX: bytes  packets  errors  dropped carrier collsns
> >>      99270697277430 172227994003 0       0       0       0
> >>
> >>   ip -s -d link ls dev enp175s0f1
> >> 7: enp175s0f1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP mode DEFAULT group default qlen 8192
> >>      link/ether 0c:c4:7a:d8:5d:1d brd ff:ff:ff:ff:ff:ff promiscuity 0 addrgenmode eui64 numtxqueues 448 numrxqueues 56 gso_max_size 65536 gso_max_segs 65535
> >>      RX: bytes  packets  errors  dropped overrun mcast
> >>      99686284170801 173507590134 61      669685  0       100304421
> >>      TX: bytes  packets  errors  dropped carrier collsns
> >>      184435107970545 142383178304 0       0       0       0
> >>

You have increased the default (1000) qlen to 8192, why?

What default qdisc do you run?... looking through your very detail main
email report (I do love the details you give!).  You run
pfifo_fast_dequeue, thus this 8192 qlen is actually having effect.

I would like to know if and how much qdisc_dequeue bulking is happening
in this setup?  Can you run:

 perf-stat-hist -m 8192 -P2 qdisc:qdisc_dequeue packets 

The perf-stat-hist is from Brendan Gregg's git-tree:
 https://github.com/brendangregg/perf-tools
 https://github.com/brendangregg/perf-tools/blob/master/misc/perf-stat-hist


> >> ./softnet.sh
> >> cpu      total    dropped   squeezed  collision        rps flow_limit
> >>
> >>
> >>
> >>
> >>     PerfTop:  108490 irqs/sec  kernel:99.6%  exact:  0.0% [4000Hz cycles],  (all, 56 CPUs)
> >> ------------------------------------------------------------------------------------------
> >>
> >>      26.78%  [kernel]       [k] queued_spin_lock_slowpath  
> >
> > This is highly suspect.
> >

I agree! -- 26.78% spend in queued_spin_lock_slowpath.  Hint if you see
_raw_spin_lock then it is likely not a contended lock, but if you see
queued_spin_lock_slowpath in a perf-report your workload is likely in
trouble.


> > A call graph (perf record -a -g sleep 1; perf report --stdio)
> > would tell what is going on.  
>
> perf report:
> https://ufile.io/rqp0h
> 

Thanks for the output (my 30" screen is just large enough to see the
full output).  Together with the flame-graph, it is clear that this
lock happens in the page allocator code.

Section copied out:

  mlx5e_poll_tx_cq
  |          
   --16.34%--napi_consume_skb
             |          
             |--12.65%--__free_pages_ok
             |          |          
             |           --11.86%--free_one_page
             |                     |          
             |                     |--10.10%--queued_spin_lock_slowpath
             |                     |          
             |                      --0.65%--_raw_spin_lock
             |          
             |--1.55%--page_frag_free
             |          
              --1.44%--skb_release_data


Let me explain what (I think) happens.  The mlx5 driver RX-page recycle
mechanism is not effective in this workload, and pages have to go
through the page allocator.  The lock contention happens during mlx5
DMA TX completion cycle.  And the page allocator cannot keep up at
these speeds.

One solution is extend page allocator with a bulk free API.  (This have
been on my TODO list for a long time, but I don't have a
micro-benchmark that trick the driver page-recycle to fail).  It should
fit nicely, as I can see that kmem_cache_free_bulk() does get
activated (bulk freeing SKBs), which means that DMA TX completion do
have a bulk of packets. 

We can (and should) also improve the page recycle scheme in the driver.
After LPC, I have a project with Tariq and Ilias (Cc'ed) to improve the
page_pool, and we will (attempt) to generalize this, for both high-end
mlx5 and more low-end ARM64-boards (macchiatobin and espressobin).

The MM-people is in parallel working to improve the performance of
order-0 page returns.  Thus, the explicit page bulk free API might
actually become less important.  I actually think (Cc.) Aaron have a
patchset he would like you to test, which removes the (zone->)lock
you hit in free_one_page().

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer