[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <92ff3f76-1ac2-84c9-1cd0-c4cb26e64074@itcare.pl>
Date: Thu, 1 Nov 2018 11:34:35 +0100
From: Paweł Staszewski <pstaszewski@...are.pl>
To: Jesper Dangaard Brouer <brouer@...hat.com>
Cc: Eric Dumazet <eric.dumazet@...il.com>,
netdev <netdev@...r.kernel.org>,
Tariq Toukan <tariqt@...lanox.com>,
Ilias Apalodimas <ilias.apalodimas@...aro.org>,
Yoel Caspersen <yoel@...knet.dk>,
Mel Gorman <mgorman@...hsingularity.net>,
Aaron Lu <aaron.lu@...el.com>
Subject: Re: Kernel 4.19 network performance - forwarding/routing normal users
traffic
W dniu 01.11.2018 o 10:22, Jesper Dangaard Brouer pisze:
> On Wed, 31 Oct 2018 23:20:01 +0100
> Paweł Staszewski <pstaszewski@...are.pl> wrote:
>
>> W dniu 31.10.2018 o 23:09, Eric Dumazet pisze:
>>> On 10/31/2018 02:57 PM, Paweł Staszewski wrote:
>>>> Hi
>>>>
>>>> So maybee someone will be interested how linux kernel handles
>>>> normal traffic (not pktgen :) )
> Pawel is this live production traffic?
Yes moved server from testlab to production to check (risking a little -
but this is traffic switched to backup router : ) )
>
> I know Yoel (Cc) is very interested to know the real-life limitation of
> Linux as a router, especially with VLANs like you use.
So yes this is real-life traffic , real users - normal mixed internet
traffic forwarded (including ddos-es :) )
>
>
>>>> Server HW configuration:
>>>>
>>>> CPU : Intel(R) Xeon(R) Gold 6132 CPU @ 2.60GHz
>>>>
>>>> NIC's: 2x 100G Mellanox ConnectX-4 (connected to x16 pcie 8GT)
>>>>
>>>>
>>>> Server software:
>>>>
>>>> FRR - as routing daemon
>>>>
>>>> enp175s0f0 (100G) - 16 vlans from upstreams (28 RSS binded to local numa node)
>>>>
>>>> enp175s0f1 (100G) - 343 vlans to clients (28 RSS binded to local numa node)
>>>>
>>>>
>>>> Maximum traffic that server can handle:
>>>>
>>>> Bandwidth
>>>>
>>>> bwm-ng v0.6.1 (probing every 1.000s), press 'h' for help
>>>> input: /proc/net/dev type: rate
>>>> \ iface Rx Tx Total
>>>> ==============================================================================
>>>> enp175s0f1: 28.51 Gb/s 37.24 Gb/s 65.74 Gb/s
>>>> enp175s0f0: 38.07 Gb/s 28.44 Gb/s 66.51 Gb/s
>>>> ------------------------------------------------------------------------------
>>>> total: 66.58 Gb/s 65.67 Gb/s 132.25 Gb/s
>>>>
> Actually rather impressive number for a Linux router.
>
>>>> Packets per second:
>>>>
>>>> bwm-ng v0.6.1 (probing every 1.000s), press 'h' for help
>>>> input: /proc/net/dev type: rate
>>>> - iface Rx Tx Total
>>>> ==============================================================================
>>>> enp175s0f1: 5248589.00 P/s 3486617.75 P/s 8735207.00 P/s
>>>> enp175s0f0: 3557944.25 P/s 5232516.00 P/s 8790460.00 P/s
>>>> ------------------------------------------------------------------------------
>>>> total: 8806533.00 P/s 8719134.00 P/s 17525668.00 P/s
>>>>
> Average packet size:
> (28.51*10^9/8)/5248589 = 678.99 bytes
> (38.07*10^9/8)/3557944 = 1337.49 bytes
>
>
>>>> After reaching that limits nics on the upstream side (more RX
>>>> traffic) start to drop packets
>>>>
>>>>
>>>> I just dont understand that server can't handle more bandwidth
>>>> (~40Gbit/s is limit where all cpu's are 100% util) - where pps on
>>>> RX side are increasing.
>>>>
>>>> Was thinking that maybee reached some pcie x16 limit - but x16 8GT
>>>> is 126Gbit - and also when testing with pktgen i can reach more bw
>>>> and pps (like 4x more comparing to normal internet traffic)
>>>>
>>>> And wondering if there is something that can be improved here.
>>>>
>>>>
>>>>
>>>> Some more informations / counters / stats and perf top below:
>>>>
>>>> Perf top flame graph:
>>>>
>>>> https://uploadfiles.io/7zo6u
> Thanks a lot for the flame graph!
>
>>>> System configuration(long):
>>>>
>>>>
>>>> cat /sys/devices/system/node/node1/cpulist
>>>> 14-27,42-55
>>>> cat /sys/class/net/enp175s0f0/device/numa_node
>>>> 1
>>>> cat /sys/class/net/enp175s0f1/device/numa_node
>>>> 1
>>>>
> Hint grep can give you nicer output that cat:
>
> $ grep -H . /sys/class/net/*/device/numa_node
Sure:
grep -H . /sys/class/net/*/device/numa_node
/sys/class/net/enp175s0f0/device/numa_node:1
/sys/class/net/enp175s0f1/device/numa_node:1
>
>>>>
>>>>
>>>>
>>>> ip -s -d link ls dev enp175s0f0
>>>> 6: enp175s0f0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP mode DEFAULT group default qlen 8192
>>>> link/ether 0c:c4:7a:d8:5d:1c brd ff:ff:ff:ff:ff:ff promiscuity 0 addrgenmode eui64 numtxqueues 448 numrxqueues 56 gso_max_size 65536 gso_max_segs 65535
>>>> RX: bytes packets errors dropped overrun mcast
>>>> 184142375840858 141347715974 2 2806325 0 85050528
>>>> TX: bytes packets errors dropped carrier collsns
>>>> 99270697277430 172227994003 0 0 0 0
>>>>
>>>> ip -s -d link ls dev enp175s0f1
>>>> 7: enp175s0f1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP mode DEFAULT group default qlen 8192
>>>> link/ether 0c:c4:7a:d8:5d:1d brd ff:ff:ff:ff:ff:ff promiscuity 0 addrgenmode eui64 numtxqueues 448 numrxqueues 56 gso_max_size 65536 gso_max_segs 65535
>>>> RX: bytes packets errors dropped overrun mcast
>>>> 99686284170801 173507590134 61 669685 0 100304421
>>>> TX: bytes packets errors dropped carrier collsns
>>>> 184435107970545 142383178304 0 0 0 0
>>>>
> You have increased the default (1000) qlen to 8192, why?
Was checking if higher txq will change anything
But no change for settings 1000,4096,8192
But yes i do not use there any traffic shaping like hfsc/hdb etc
- just default qdisc mq 0:
root pfifp_fast
tc qdisc show dev enp175s0f1
qdisc mq 0: root
qdisc pfifo_fast 0: parent :38 bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1 1
1 1 1 1
qdisc pfifo_fast 0: parent :37 bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1 1
1 1 1 1
qdisc pfifo_fast 0: parent :36 bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1 1
1 1 1 1
...
...
And vlans are noqueue
tc -s -d qdisc show dev vlan1521
qdisc noqueue 0: root refcnt 2
Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
backlog 0b 0p requeues 0
Weird is that no counters increasing but there is traffic in/out on that
vlans
ip -s -d link ls dev vlan1521
87: vlan1521@...175s0f1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500
qdisc noqueue state UP mode DEFAULT group default qlen 1000
link/ether 0c:c4:7a:d8:5d:1d brd ff:ff:ff:ff:ff:ff promiscuity 0
vlan protocol 802.1Q id 1521 <REORDER_HDR> addrgenmode eui64
numtxqueues 1 numrxqueues 1 gso_max_size 65536 gso_max_segs 65535
RX: bytes packets errors dropped overrun mcast
562964218394 1639370761 0 0 0 0
TX: bytes packets errors dropped carrier collsns
1417648713052 618271312 0 0 0 0
>
> What default qdisc do you run?... looking through your very detail main
> email report (I do love the details you give!). You run
> pfifo_fast_dequeue, thus this 8192 qlen is actually having effect.
>
> I would like to know if and how much qdisc_dequeue bulking is happening
> in this setup? Can you run:
>
> perf-stat-hist -m 8192 -P2 qdisc:qdisc_dequeue packets
>
> The perf-stat-hist is from Brendan Gregg's git-tree:
> https://github.com/brendangregg/perf-tools
> https://github.com/brendangregg/perf-tools/blob/master/misc/perf-stat-hist
>
./perf-stat-hist -m 8192 -P2 qdisc:qdisc_dequeue packets
Tracing qdisc:qdisc_dequeue, power-of-2, max 8192, until Ctrl-C...
^C
Range : Count Distribution
-> -1 : 0 | |
0 -> 0 : 43768349
|######################################|
1 -> 1 : 43895249
|######################################|
2 -> 3 : 352 |# |
4 -> 7 : 228 |# |
8 -> 15 : 135 |# |
16 -> 31 : 73 |# |
32 -> 63 : 7 |# |
64 -> 127 : 0 | |
128 -> 255 : 0 | |
256 -> 511 : 0 | |
512 -> 1023 : 0 | |
1024 -> 2047 : 0 | |
2048 -> 4095 : 0 | |
4096 -> 8191 : 0 | |
8192 -> : 0 | |
>>>> ./softnet.sh
>>>> cpu total dropped squeezed collision rps flow_limit
>>>>
>>>>
>>>>
>>>>
>>>> PerfTop: 108490 irqs/sec kernel:99.6% exact: 0.0% [4000Hz cycles], (all, 56 CPUs)
>>>> ------------------------------------------------------------------------------------------
>>>>
>>>> 26.78% [kernel] [k] queued_spin_lock_slowpath
>>> This is highly suspect.
>>>
> I agree! -- 26.78% spend in queued_spin_lock_slowpath. Hint if you see
> _raw_spin_lock then it is likely not a contended lock, but if you see
> queued_spin_lock_slowpath in a perf-report your workload is likely in
> trouble.
>
>
>>> A call graph (perf record -a -g sleep 1; perf report --stdio)
>>> would tell what is going on.
>> perf report:
>> https://ufile.io/rqp0h
>>
> Thanks for the output (my 30" screen is just large enough to see the
> full output). Together with the flame-graph, it is clear that this
> lock happens in the page allocator code.
>
> Section copied out:
>
> mlx5e_poll_tx_cq
> |
> --16.34%--napi_consume_skb
> |
> |--12.65%--__free_pages_ok
> | |
> | --11.86%--free_one_page
> | |
> | |--10.10%--queued_spin_lock_slowpath
> | |
> | --0.65%--_raw_spin_lock
> |
> |--1.55%--page_frag_free
> |
> --1.44%--skb_release_data
>
>
> Let me explain what (I think) happens. The mlx5 driver RX-page recycle
> mechanism is not effective in this workload, and pages have to go
> through the page allocator. The lock contention happens during mlx5
> DMA TX completion cycle. And the page allocator cannot keep up at
> these speeds.
>
> One solution is extend page allocator with a bulk free API. (This have
> been on my TODO list for a long time, but I don't have a
> micro-benchmark that trick the driver page-recycle to fail). It should
> fit nicely, as I can see that kmem_cache_free_bulk() does get
> activated (bulk freeing SKBs), which means that DMA TX completion do
> have a bulk of packets.
>
> We can (and should) also improve the page recycle scheme in the driver.
> After LPC, I have a project with Tariq and Ilias (Cc'ed) to improve the
> page_pool, and we will (attempt) to generalize this, for both high-end
> mlx5 and more low-end ARM64-boards (macchiatobin and espressobin).
>
> The MM-people is in parallel working to improve the performance of
> order-0 page returns. Thus, the explicit page bulk free API might
> actually become less important. I actually think (Cc.) Aaron have a
> patchset he would like you to test, which removes the (zone->)lock
> you hit in free_one_page().
>
Ok - Thank You Jesper
Powered by blists - more mailing lists