lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:   Tue, 15 Aug 2017 12:05:37 +0200
From:   Paweł Staszewski <pstaszewski@...are.pl>
To:     Jesper Dangaard Brouer <brouer@...hat.com>
Cc:     Linux Kernel Network Developers <netdev@...r.kernel.org>,
        Alexander Duyck <alexander.duyck@...il.com>,
        Saeed Mahameed <saeedm@...lanox.com>,
        Tariq Toukan <tariqt@...lanox.com>
Subject: Re: Kernel 4.13.0-rc4-next-20170811 - IP Routing / Forwarding
 performance vs Core/RSS number / HT on



W dniu 2017-08-15 o 12:02, Paweł Staszewski pisze:
>
>
> W dniu 2017-08-15 o 11:57, Jesper Dangaard Brouer pisze:
>> On Tue, 15 Aug 2017 11:30:43 +0200 Paweł Staszewski 
>> <pstaszewski@...are.pl> wrote:
>>
>>> W dniu 2017-08-15 o 11:23, Jesper Dangaard Brouer pisze:
>>>> On Tue, 15 Aug 2017 02:38:56 +0200
>>>> Paweł Staszewski <pstaszewski@...are.pl> wrote:
>>>>> W dniu 2017-08-14 o 18:19, Jesper Dangaard Brouer pisze:
>>>>>> On Sun, 13 Aug 2017 18:58:58 +0200 Paweł Staszewski 
>>>>>> <pstaszewski@...are.pl> wrote:
>>>>>>> To show some difference below comparision vlan/no-vlan traffic
>>>>>>>
>>>>>>> 10Mpps forwarded traffic vith no-vlan vs 6.9Mpps with vlan
>>>>>> I'm trying to reproduce in my testlab (with ixgbe).  I do see, a
>>>>>> performance reduction of about 10-19% when I forward out a VLAN
>>>>>> interface.  This is larger than I expected, but still lower than 
>>>>>> what
>>>>>> you reported 30-40% slowdown.
>>>>>>
>>>>>> [...]
>>>>> Ok mellanox afrrived (MT27700 - mlnx5 driver)
>>>>> And to compare melannox with vlans and without: 33% performance
>>>>> degradation (less than with ixgbe where i reach ~40% with same 
>>>>> settings)
>>>>>
>>>>> Mellanox without TX traffix on vlan:
>>>>> ID;CPU_CORES / RSS QUEUES;PKT_SIZE;PPS_RX;BPS_RX;PPS_TX;BPS_TX
>>>>> 0;16;64;11089305;709715520;8871553;567779392
>>>>> 1;16;64;11096292;710162688;11095566;710116224
>>>>> 2;16;64;11095770;710129280;11096799;710195136
>>>>> 3;16;64;11097199;710220736;11097702;710252928
>>>>> 4;16;64;11080984;567081856;11079662;709098368
>>>>> 5;16;64;11077696;708972544;11077039;708930496
>>>>> 6;16;64;11082991;709311424;8864802;567347328
>>>>> 7;16;64;11089596;709734144;8870927;709789184
>>>>> 8;16;64;11094043;710018752;11095391;710105024
>>>>>
>>>>> Mellanox with TX traffic on vlan:
>>>>> ID;CPU_CORES / RSS QUEUES;PKT_SIZE;PPS_RX;BPS_RX;PPS_TX;BPS_TX
>>>>> 0;16;64;7369914;471674496;7370281;471697980
>>>>> 1;16;64;7368896;471609408;7368043;471554752
>>>>> 2;16;64;7367577;471524864;7367759;471536576
>>>>> 3;16;64;7368744;377305344;7369391;471641024
>>>>> 4;16;64;7366824;471476736;7364330;471237120
>>>>> 5;16;64;7368352;471574528;7367239;471503296
>>>>> 6;16;64;7367459;471517376;7367806;471539584
>>>>> 7;16;64;7367190;471500160;7367988;471551232
>>>>> 8;16;64;7368023;471553472;7368076;471556864
>>>> I wonder if the drivers page recycler is active/working or not, and if
>>>> the situation is different between VLAN vs no-vlan (given
>>>> page_frag_free is so high in you perf top).  The Mellanox drivers
>>>> fortunately have a stats counter to tell us this explicitly (which the
>>>> ixgbe driver doesn't).
>>>>
>>>> You can use my ethtool_stats.pl script watch these stats:
>>>> https://github.com/netoptimizer/network-testing/blob/master/bin/ethtool_stats.pl
>>>> (Hint perl dependency:  dnf install perl-Time-HiRes)
>>> For RX NIC:
>>> Show adapter(s) (enp175s0f0) statistics (ONLY that changed!)
>>> Ethtool(enp175s0f0) stat:     78380071 (     78,380,071) <= 
>>> rx0_bytes /sec
>>> Ethtool(enp175s0f0) stat:       230978 (        230,978) <= 
>>> rx0_cache_reuse /sec
>>> Ethtool(enp175s0f0) stat:      1152648 (      1,152,648) <= 
>>> rx0_csum_complete /sec
>>> Ethtool(enp175s0f0) stat:      1152648 (      1,152,648) <= 
>>> rx0_packets /sec
>>> Ethtool(enp175s0f0) stat:       921614 (        921,614) <= 
>>> rx0_page_reuse /sec
>>> Ethtool(enp175s0f0) stat:     78956591 (     78,956,591) <= 
>>> rx1_bytes /sec
>>> Ethtool(enp175s0f0) stat:       233343 (        233,343) <= 
>>> rx1_cache_reuse /sec
>>> Ethtool(enp175s0f0) stat:      1161126 (      1,161,126) <= 
>>> rx1_csum_complete /sec
>>> Ethtool(enp175s0f0) stat:      1161126 (      1,161,126) <= 
>>> rx1_packets /sec
>>> Ethtool(enp175s0f0) stat:       927793 (        927,793) <= 
>>> rx1_page_reuse /sec
>>> Ethtool(enp175s0f0) stat:     79677124 (     79,677,124) <= 
>>> rx2_bytes /sec
>>> Ethtool(enp175s0f0) stat:       233735 (        233,735) <= 
>>> rx2_cache_reuse /sec
>>> Ethtool(enp175s0f0) stat:      1171722 (      1,171,722) <= 
>>> rx2_csum_complete /sec
>>> Ethtool(enp175s0f0) stat:      1171722 (      1,171,722) <= 
>>> rx2_packets /sec
>>> Ethtool(enp175s0f0) stat:       937989 (        937,989) <= 
>>> rx2_page_reuse /sec
>>> Ethtool(enp175s0f0) stat:     78392893 (     78,392,893) <= 
>>> rx3_bytes /sec
>>> Ethtool(enp175s0f0) stat:       230311 (        230,311) <= 
>>> rx3_cache_reuse /sec
>>> Ethtool(enp175s0f0) stat:      1152837 (      1,152,837) <= 
>>> rx3_csum_complete /sec
>>> Ethtool(enp175s0f0) stat:      1152837 (      1,152,837) <= 
>>> rx3_packets /sec
>>> Ethtool(enp175s0f0) stat:       922513 (        922,513) <= 
>>> rx3_page_reuse /sec
>>> Ethtool(enp175s0f0) stat:     65165583 (     65,165,583) <= 
>>> rx4_bytes /sec
>>> Ethtool(enp175s0f0) stat:       191969 (        191,969) <= 
>>> rx4_cache_reuse /sec
>>> Ethtool(enp175s0f0) stat:       958317 (        958,317) <= 
>>> rx4_csum_complete /sec
>>> Ethtool(enp175s0f0) stat:       958317 (        958,317) <= 
>>> rx4_packets /sec
>>> Ethtool(enp175s0f0) stat:       766332 (        766,332) <= 
>>> rx4_page_reuse /sec
>>> Ethtool(enp175s0f0) stat:     66920721 (     66,920,721) <= 
>>> rx5_bytes /sec
>>> Ethtool(enp175s0f0) stat:       197150 (        197,150) <= 
>>> rx5_cache_reuse /sec
>>> Ethtool(enp175s0f0) stat:       984128 (        984,128) <= 
>>> rx5_csum_complete /sec
>>> Ethtool(enp175s0f0) stat:       984128 (        984,128) <= 
>>> rx5_packets /sec
>>> Ethtool(enp175s0f0) stat:       786978 (        786,978) <= 
>>> rx5_page_reuse /sec
>>> Ethtool(enp175s0f0) stat:     79076984 (     79,076,984) <= 
>>> rx6_bytes /sec
>>> Ethtool(enp175s0f0) stat:       233735 (        233,735) <= 
>>> rx6_cache_reuse /sec
>>> Ethtool(enp175s0f0) stat:      1162897 (      1,162,897) <= 
>>> rx6_csum_complete /sec
>>> Ethtool(enp175s0f0) stat:      1162897 (      1,162,897) <= 
>>> rx6_packets /sec
>>> Ethtool(enp175s0f0) stat:       929163 (        929,163) <= 
>>> rx6_page_reuse /sec
>>> Ethtool(enp175s0f0) stat:     78660672 (     78,660,672) <= 
>>> rx7_bytes /sec
>>> Ethtool(enp175s0f0) stat:       230413 (        230,413) <= 
>>> rx7_cache_reuse /sec
>>> Ethtool(enp175s0f0) stat:      1156775 (      1,156,775) <= 
>>> rx7_csum_complete /sec
>>> Ethtool(enp175s0f0) stat:      1156775 (      1,156,775) <= 
>>> rx7_packets /sec
>>> Ethtool(enp175s0f0) stat:       926376 (        926,376) <= 
>>> rx7_page_reuse /sec
>>> Ethtool(enp175s0f0) stat:     10674565 (     10,674,565) <= 
>>> rx_65_to_127_bytes_phy /sec
>>> Ethtool(enp175s0f0) stat:    605241031 (    605,241,031) <= rx_bytes 
>>> /sec
>>> Ethtool(enp175s0f0) stat:    768585608 (    768,585,608) <= 
>>> rx_bytes_phy /sec
>>> Ethtool(enp175s0f0) stat:      1781569 (      1,781,569) <= 
>>> rx_cache_reuse /sec
>>> Ethtool(enp175s0f0) stat:      8900603 (      8,900,603) <= 
>>> rx_csum_complete /sec
>>> Ethtool(enp175s0f0) stat:      1773785 (      1,773,785) <= 
>>> rx_out_of_buffer /sec
>>> Ethtool(enp175s0f0) stat:      8900603 (      8,900,603) <= 
>>> rx_packets /sec
>>> Ethtool(enp175s0f0) stat:     10674799 (     10,674,799) <= 
>>> rx_packets_phy /sec
>>> Ethtool(enp175s0f0) stat:      7118993 (      7,118,993) <= 
>>> rx_page_reuse /sec
>>> Ethtool(enp175s0f0) stat:    768565744 (    768,565,744) <= 
>>> rx_prio0_bytes /sec
>>> Ethtool(enp175s0f0) stat:     10674522 (     10,674,522) <= 
>>> rx_prio0_packets /sec
>>> Ethtool(enp175s0f0) stat:    725871089 (    725,871,089) <= 
>>> rx_vport_unicast_bytes /sec
>>> Ethtool(enp175s0f0) stat:     10674575 (     10,674,575) <= 
>>> rx_vport_unicast_packets /sec
>>
>> It looks like the mlx5 page recycle mechanism works:
>>
>>     230413 (        230,413) <= rx7_cache_reuse /sec
>>   + 926376 (        926,376) <= rx7_page_reuse /sec
>>   =1156789 (230413+926376)
>>   -1156775 (      1,156,775) <= rx7_packets /sec
>>   =     14
>>
>> You can also determine this as there are no counters for:
>>   rx_cache_full or
>>   rx_cache_empty or
>>   rx1_cache_empty
>>   rx1_cache_busy
>>
>>> For TX nic with vlan:
>>> Show adapter(s) (enp175s0f1) statistics (ONLY that changed!)
>>> Ethtool(enp175s0f1) stat:            1 (              1) <= 
>>> rx_65_to_127_bytes_phy /sec
>>> Ethtool(enp175s0f1) stat:           71 (             71) <= 
>>> rx_bytes_phy /sec
>>> Ethtool(enp175s0f1) stat:            1 (              1) <= 
>>> rx_multicast_phy /sec
>>> Ethtool(enp175s0f1) stat:            1 (              1) <= 
>>> rx_packets_phy /sec
>>> Ethtool(enp175s0f1) stat:           71 (             71) <= 
>>> rx_prio0_bytes /sec
>>> Ethtool(enp175s0f1) stat:            1 (              1) <= 
>>> rx_prio0_packets /sec
>>> Ethtool(enp175s0f1) stat:           67 (             67) <= 
>>> rx_vport_multicast_bytes /sec
>>> Ethtool(enp175s0f1) stat:            1 (              1) <= 
>>> rx_vport_multicast_packets /sec
>>> Ethtool(enp175s0f1) stat:     64955114 (     64,955,114) <= 
>>> tx0_bytes /sec
>>> Ethtool(enp175s0f1) stat:       955222 (        955,222) <= 
>>> tx0_csum_none /sec
>>> Ethtool(enp175s0f1) stat:        26489 (         26,489) <= tx0_nop 
>>> /sec
>>> Ethtool(enp175s0f1) stat:       955222 (        955,222) <= 
>>> tx0_packets /sec
>>> Ethtool(enp175s0f1) stat:     66799214 (     66,799,214) <= 
>>> tx1_bytes /sec
>>> Ethtool(enp175s0f1) stat:       982341 (        982,341) <= 
>>> tx1_csum_none /sec
>>> Ethtool(enp175s0f1) stat:        27225 (         27,225) <= tx1_nop 
>>> /sec
>>> Ethtool(enp175s0f1) stat:       982341 (        982,341) <= 
>>> tx1_packets /sec
>>> Ethtool(enp175s0f1) stat:     78650421 (     78,650,421) <= 
>>> tx2_bytes /sec
>>> Ethtool(enp175s0f1) stat:      1156624 (      1,156,624) <= 
>>> tx2_csum_none /sec
>>> Ethtool(enp175s0f1) stat:        32059 (         32,059) <= tx2_nop 
>>> /sec
>>> Ethtool(enp175s0f1) stat:      1156624 (      1,156,624) <= 
>>> tx2_packets /sec
>>> Ethtool(enp175s0f1) stat:     78186849 (     78,186,849) <= 
>>> tx3_bytes /sec
>>> Ethtool(enp175s0f1) stat:      1149807 (      1,149,807) <= 
>>> tx3_csum_none /sec
>>> Ethtool(enp175s0f1) stat:        31879 (         31,879) <= tx3_nop 
>>> /sec
>>> Ethtool(enp175s0f1) stat:      1149807 (      1,149,807) <= 
>>> tx3_packets /sec
>>> Ethtool(enp175s0f1) stat:          234 (            234) <= 
>>> tx3_xmit_more /sec
>>> Ethtool(enp175s0f1) stat:     78466099 (     78,466,099) <= 
>>> tx4_bytes /sec
>>> Ethtool(enp175s0f1) stat:      1153913 (      1,153,913) <= 
>>> tx4_csum_none /sec
>>> Ethtool(enp175s0f1) stat:        31990 (         31,990) <= tx4_nop 
>>> /sec
>>> Ethtool(enp175s0f1) stat:      1153913 (      1,153,913) <= 
>>> tx4_packets /sec
>>> Ethtool(enp175s0f1) stat:     78765724 (     78,765,724) <= 
>>> tx5_bytes /sec
>>> Ethtool(enp175s0f1) stat:      1158319 (      1,158,319) <= 
>>> tx5_csum_none /sec
>>> Ethtool(enp175s0f1) stat:        32115 (         32,115) <= tx5_nop 
>>> /sec
>>> Ethtool(enp175s0f1) stat:      1158319 (      1,158,319) <= 
>>> tx5_packets /sec
>>> Ethtool(enp175s0f1) stat:          264 (            264) <= 
>>> tx5_xmit_more /sec
>>> Ethtool(enp175s0f1) stat:     79669524 (     79,669,524) <= 
>>> tx6_bytes /sec
>>> Ethtool(enp175s0f1) stat:      1171611 (      1,171,611) <= 
>>> tx6_csum_none /sec
>>> Ethtool(enp175s0f1) stat:        32490 (         32,490) <= tx6_nop 
>>> /sec
>>> Ethtool(enp175s0f1) stat:      1171611 (      1,171,611) <= 
>>> tx6_packets /sec
>>> Ethtool(enp175s0f1) stat:     79389329 (     79,389,329) <= 
>>> tx7_bytes /sec
>>> Ethtool(enp175s0f1) stat:      1167490 (      1,167,490) <= 
>>> tx7_csum_none /sec
>>> Ethtool(enp175s0f1) stat:        32365 (         32,365) <= tx7_nop 
>>> /sec
>>> Ethtool(enp175s0f1) stat:      1167490 (      1,167,490) <= 
>>> tx7_packets /sec
>>> Ethtool(enp175s0f1) stat:    604885175 (    604,885,175) <= tx_bytes 
>>> /sec
>>> Ethtool(enp175s0f1) stat:    676059749 (    676,059,749) <= 
>>> tx_bytes_phy /sec
>>> Ethtool(enp175s0f1) stat:      8895370 (      8,895,370) <= 
>>> tx_packets /sec
>>> Ethtool(enp175s0f1) stat:      8895522 (      8,895,522) <= 
>>> tx_packets_phy /sec
>>> Ethtool(enp175s0f1) stat:    676063067 (    676,063,067) <= 
>>> tx_prio0_bytes /sec
>>> Ethtool(enp175s0f1) stat:      8895566 (      8,895,566) <= 
>>> tx_prio0_packets /sec
>>> Ethtool(enp175s0f1) stat:    640470657 (    640,470,657) <= 
>>> tx_vport_unicast_bytes /sec
>>> Ethtool(enp175s0f1) stat:      8895427 (      8,895,427) <= 
>>> tx_vport_unicast_packets /sec
>>> Ethtool(enp175s0f1) stat:          498 (            498) <= 
>>> tx_xmit_more /sec
>> We are seeing some xmit_more, this is interesting.  Have you noticed,
>> if (in the VLAN case) there is a queue in the qdisc layer?
>>
>> Simply inspect with: tc -s qdisc show dev ixgbe2
> for vlan no qdisc:
> tc -s -d qdisc show dev vlan1000
>
> qdisc noqueue 0: root refcnt 2
>  Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
>  backlog 0b 0p requeues 0
>
> physical interface mq attached with pfifo_fast:
>
> tc -s -d qdisc show dev enp175s0f1
> qdisc mq 0: root
>  Sent 1397200697212 bytes 3965888669 pkt (dropped 78065663, overlimits 
> 0 requeues 629868)
>  backlog 0b 0p requeues 629868
> qdisc pfifo_fast 0: parent :38 bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 
> 1 1 1 1 1
>  Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
>  backlog 0b 0p requeues 0
> qdisc pfifo_fast 0: parent :37 bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 
> 1 1 1 1 1
>  Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
>  backlog 0b 0p requeues 0
> qdisc pfifo_fast 0: parent :36 bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 
> 1 1 1 1 1
>  Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
>  backlog 0b 0p requeues 0
> qdisc pfifo_fast 0: parent :35 bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 
> 1 1 1 1 1
>  Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
>  backlog 0b 0p requeues 0
> qdisc pfifo_fast 0: parent :34 bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 
> 1 1 1 1 1
>  Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
>  backlog 0b 0p requeues 0
> qdisc pfifo_fast 0: parent :33 bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 
> 1 1 1 1 1
>  Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
>  backlog 0b 0p requeues 0
> qdisc pfifo_fast 0: parent :32 bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 
> 1 1 1 1 1
>  Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
>  backlog 0b 0p requeues 0
> qdisc pfifo_fast 0: parent :31 bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 
> 1 1 1 1 1
>  Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
>  backlog 0b 0p requeues 0
> qdisc pfifo_fast 0: parent :30 bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 
> 1 1 1 1 1
>  Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
>  backlog 0b 0p requeues 0
> qdisc pfifo_fast 0: parent :2f bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 
> 1 1 1 1 1
>  Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
>  backlog 0b 0p requeues 0
> qdisc pfifo_fast 0: parent :2e bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 
> 1 1 1 1 1
>  Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
>  backlog 0b 0p requeues 0
> qdisc pfifo_fast 0: parent :2d bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 
> 1 1 1 1 1
>  Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
>  backlog 0b 0p requeues 0
> qdisc pfifo_fast 0: parent :2c bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 
> 1 1 1 1 1
>  Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
>  backlog 0b 0p requeues 0
> qdisc pfifo_fast 0: parent :2b bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 
> 1 1 1 1 1
>  Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
>  backlog 0b 0p requeues 0
> qdisc pfifo_fast 0: parent :2a bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 
> 1 1 1 1 1
>  Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
>  backlog 0b 0p requeues 0
> qdisc pfifo_fast 0: parent :29 bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 
> 1 1 1 1 1
>  Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
>  backlog 0b 0p requeues 0
> qdisc pfifo_fast 0: parent :28 bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 
> 1 1 1 1 1
>  Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
>  backlog 0b 0p requeues 0
> qdisc pfifo_fast 0: parent :27 bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 
> 1 1 1 1 1
>  Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
>  backlog 0b 0p requeues 0
> qdisc pfifo_fast 0: parent :26 bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 
> 1 1 1 1 1
>  Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
>  backlog 0b 0p requeues 0
> qdisc pfifo_fast 0: parent :25 bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 
> 1 1 1 1 1
>  Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
>  backlog 0b 0p requeues 0
> qdisc pfifo_fast 0: parent :24 bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 
> 1 1 1 1 1
>  Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
>  backlog 0b 0p requeues 0
> qdisc pfifo_fast 0: parent :23 bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 
> 1 1 1 1 1
>  Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
>  backlog 0b 0p requeues 0
> qdisc pfifo_fast 0: parent :22 bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 
> 1 1 1 1 1
>  Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
>  backlog 0b 0p requeues 0
> qdisc pfifo_fast 0: parent :21 bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 
> 1 1 1 1 1
>  Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
>  backlog 0b 0p requeues 0
> qdisc pfifo_fast 0: parent :20 bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 
> 1 1 1 1 1
>  Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
>  backlog 0b 0p requeues 0
> qdisc pfifo_fast 0: parent :1f bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 
> 1 1 1 1 1
>  Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
>  backlog 0b 0p requeues 0
> qdisc pfifo_fast 0: parent :1e bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 
> 1 1 1 1 1
>  Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
>  backlog 0b 0p requeues 0
> qdisc pfifo_fast 0: parent :1d bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 
> 1 1 1 1 1
>  Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
>  backlog 0b 0p requeues 0
> qdisc pfifo_fast 0: parent :1c bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 
> 1 1 1 1 1
>  Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
>  backlog 0b 0p requeues 0
> qdisc pfifo_fast 0: parent :1b bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 
> 1 1 1 1 1
>  Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
>  backlog 0b 0p requeues 0
> qdisc pfifo_fast 0: parent :1a bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 
> 1 1 1 1 1
>  Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
>  backlog 0b 0p requeues 0
> qdisc pfifo_fast 0: parent :19 bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 
> 1 1 1 1 1
>  Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
>  backlog 0b 0p requeues 0
> qdisc pfifo_fast 0: parent :18 bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 
> 1 1 1 1 1
>  Sent 176 bytes 2 pkt (dropped 0, overlimits 0 requeues 0)
>  backlog 0b 0p requeues 0
> qdisc pfifo_fast 0: parent :17 bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 
> 1 1 1 1 1
>  Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
>  backlog 0b 0p requeues 0
> qdisc pfifo_fast 0: parent :16 bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 
> 1 1 1 1 1
>  Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
>  backlog 0b 0p requeues 0
> qdisc pfifo_fast 0: parent :15 bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 
> 1 1 1 1 1
>  Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
>  backlog 0b 0p requeues 0
> qdisc pfifo_fast 0: parent :14 bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 
> 1 1 1 1 1
>  Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
>  backlog 0b 0p requeues 0
> qdisc pfifo_fast 0: parent :13 bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 
> 1 1 1 1 1
>  Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
>  backlog 0b 0p requeues 0
> qdisc pfifo_fast 0: parent :12 bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 
> 1 1 1 1 1
>  Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
>  backlog 0b 0p requeues 0
> qdisc pfifo_fast 0: parent :11 bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 
> 1 1 1 1 1
>  Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
>  backlog 0b 0p requeues 0
> qdisc pfifo_fast 0: parent :10 bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 
> 1 1 1 1 1
>  Sent 18534143578 bytes 289595993 pkt (dropped 0, overlimits 0 
> requeues 1191)
>  backlog 0b 0p requeues 1191
> qdisc pfifo_fast 0: parent :f bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 1 
> 1 1 1 1
>  Sent 17022213952 bytes 265972093 pkt (dropped 0, overlimits 0 
> requeues 1126)
>  backlog 0b 0p requeues 1126
> qdisc pfifo_fast 0: parent :e bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 1 
> 1 1 1 1
>  Sent 19610732552 bytes 306417696 pkt (dropped 0, overlimits 0 
> requeues 147499)
>  backlog 0b 0p requeues 147499
> qdisc pfifo_fast 0: parent :d bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 1 
> 1 1 1 1
>  Sent 18153006206 bytes 283640721 pkt (dropped 0, overlimits 0 
> requeues 171576)
>  backlog 0b 0p requeues 171576
> qdisc pfifo_fast 0: parent :c bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 1 
> 1 1 1 1
>  Sent 20642461846 bytes 322538466 pkt (dropped 0, overlimits 0 
> requeues 67124)
>  backlog 0b 0p requeues 67124
> qdisc pfifo_fast 0: parent :b bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 1 
> 1 1 1 1
>  Sent 19343910144 bytes 302248596 pkt (dropped 0, overlimits 0 
> requeues 224627)
>  backlog 0b 0p requeues 224627
> qdisc pfifo_fast 0: parent :a bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 1 
> 1 1 1 1
>  Sent 20735716968 bytes 323995576 pkt (dropped 0, overlimits 0 
> requeues 1451)
>  backlog 0b 0p requeues 1451
> qdisc pfifo_fast 0: parent :9 bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 1 
> 1 1 1 1
>  Sent 19371526272 bytes 302680098 pkt (dropped 0, overlimits 0 
> requeues 2244)
>  backlog 0b 0p requeues 2244
> qdisc pfifo_fast 0: parent :8 bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 1 
> 1 1 1 1
>  Sent 162202525454 bytes 2444195441 pkt (dropped 31716535, overlimits 
> 0 requeues 2456)
>  backlog 0b 0p requeues 2456
> qdisc pfifo_fast 0: parent :7 bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 1 
> 1 1 1 1
>  Sent 163308408548 bytes 2461561759 pkt (dropped 0, overlimits 0 
> requeues 1454)
>  backlog 0b 0p requeues 1454
> qdisc pfifo_fast 0: parent :6 bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 1 
> 1 1 1 1
>  Sent 164214830346 bytes 2475734074 pkt (dropped 19272, overlimits 0 
> requeues 2216)
>  backlog 0b 0p requeues 2216
> qdisc pfifo_fast 0: parent :5 bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 1 
> 1 1 1 1
>  Sent 160398101830 bytes 2417635038 pkt (dropped 0, overlimits 0 
> requeues 1509)
>  backlog 0b 0p requeues 1509
> qdisc pfifo_fast 0: parent :4 bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 1 
> 1 1 1 1
>  Sent 161760263056 bytes 2437965677 pkt (dropped 0, overlimits 0 
> requeues 1442)
>  backlog 0b 0p requeues 1442
> qdisc pfifo_fast 0: parent :3 bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 1 
> 1 1 1 1
>  Sent 160300203040 bytes 2415206290 pkt (dropped 0, overlimits 0 
> requeues 1615)
>  backlog 0b 0p requeues 1615
> qdisc pfifo_fast 0: parent :2 bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 1 
> 1 1 1 1
>  Sent 140144336184 bytes 2114089739 pkt (dropped 0, overlimits 0 
> requeues 1177)
>  backlog 0b 0p requeues 1177
> qdisc pfifo_fast 0: parent :1 bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 1 
> 1 1 1 1
>  Sent 131458317060 bytes 1982280594 pkt (dropped 46329856, overlimits 
> 0 requeues 1161)
>  backlog 0b 0p requeues 1161
>
>
just see that after changing RSS on nics did't deleted qdisc and added 
again:
Here situation with qdisc del / add
tc -s -d qdisc show dev enp175s0f1
qdisc mq 1: root
  Sent 43738523966 bytes 683414438 pkt (dropped 0, overlimits 0 requeues 
1886)
  backlog 0b 0p requeues 1886
qdisc pfifo_fast 0: parent 1:10 bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 1 
1 1 1 1
  Sent 2585011904 bytes 40390811 pkt (dropped 0, overlimits 0 requeues 110)
  backlog 0b 0p requeues 110
qdisc pfifo_fast 0: parent 1:f bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 1 
1 1 1 1
  Sent 2602068416 bytes 40657319 pkt (dropped 0, overlimits 0 requeues 121)
  backlog 0b 0p requeues 121
qdisc pfifo_fast 0: parent 1:e bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 1 
1 1 1 1
  Sent 2793524544 bytes 43648821 pkt (dropped 0, overlimits 0 requeues 12)
  backlog 0b 0p requeues 12
qdisc pfifo_fast 0: parent 1:d bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 1 
1 1 1 1
  Sent 2792512768 bytes 43633012 pkt (dropped 0, overlimits 0 requeues 8)
  backlog 0b 0p requeues 8
qdisc pfifo_fast 0: parent 1:c bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 1 
1 1 1 1
  Sent 2789822080 bytes 43590970 pkt (dropped 0, overlimits 0 requeues 8)
  backlog 0b 0p requeues 8
qdisc pfifo_fast 0: parent 1:b bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 1 
1 1 1 1
  Sent 2789904640 bytes 43592260 pkt (dropped 0, overlimits 0 requeues 9)
  backlog 0b 0p requeues 9
qdisc pfifo_fast 0: parent 1:a bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 1 
1 1 1 1
  Sent 2793279146 bytes 43644987 pkt (dropped 0, overlimits 0 requeues 192)
  backlog 0b 0p requeues 192
qdisc pfifo_fast 0: parent 1:9 bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 1 
1 1 1 1
  Sent 2789506688 bytes 43586042 pkt (dropped 0, overlimits 0 requeues 171)
  backlog 0b 0p requeues 171
qdisc pfifo_fast 0: parent 1:8 bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 1 
1 1 1 1
  Sent 2789948928 bytes 43592952 pkt (dropped 0, overlimits 0 requeues 179)
  backlog 0b 0p requeues 179
qdisc pfifo_fast 0: parent 1:7 bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 1 
1 1 1 1
  Sent 2790191552 bytes 43596743 pkt (dropped 0, overlimits 0 requeues 177)
  backlog 0b 0p requeues 177
qdisc pfifo_fast 0: parent 1:6 bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 1 
1 1 1 1
  Sent 2789649280 bytes 43588270 pkt (dropped 0, overlimits 0 requeues 182)
  backlog 0b 0p requeues 182
qdisc pfifo_fast 0: parent 1:5 bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 1 
1 1 1 1
  Sent 2789654612 bytes 43588354 pkt (dropped 0, overlimits 0 requeues 151)
  backlog 0b 0p requeues 151
qdisc pfifo_fast 0: parent 1:4 bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 1 
1 1 1 1
  Sent 2789409280 bytes 43584520 pkt (dropped 0, overlimits 0 requeues 168)
  backlog 0b 0p requeues 168
qdisc pfifo_fast 0: parent 1:3 bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 1 
1 1 1 1
  Sent 2792827648 bytes 43637932 pkt (dropped 0, overlimits 0 requeues 165)
  backlog 0b 0p requeues 165
qdisc pfifo_fast 0: parent 1:2 bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 1 
1 1 1 1
  Sent 2572823936 bytes 40200374 pkt (dropped 0, overlimits 0 requeues 119)
  backlog 0b 0p requeues 119
qdisc pfifo_fast 0: parent 1:1 bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 1 
1 1 1 1
  Sent 2488427264 bytes 38881676 pkt (dropped 0, overlimits 0 requeues 114)
  backlog 0b 0p requeues 114


>
>>
>>
>>>>> ethtool settings for both tests:
>>>>> ifc='enp175s0f0 enp175s0f1'
>>>>> for i in $ifc
>>>>>            do
>>>>>            ip link set up dev $i
>>>>>            ethtool -A $i autoneg off rx off tx off
>>>>>            ethtool -G $i rx 128 tx 256
>>>> The ring queue size recommendations, might be different for the mlx5
>>>> driver (Cc'ing Mellanox maintainers).
>>>>
>>>>>            ip link set $i txqueuelen 1000
>>>>>            ethtool -C $i rx-usecs 25
>>>>>            ethtool -L $i combined 16
>>>>>            ethtool -K $i gro off tso off gso off sg on 
>>>>> l2-fwd-offload off
>>>>> tx-nocache-copy off ntuple on
>>>>>            ethtool -N $i rx-flow-hash udp4 sdfn
>>>>>            done
>>>> Thanks for being explicit about what you setup is :-)
>>>>> and perf top:
>>>>>       PerfTop:   83650 irqs/sec  kernel:99.7%  exact: 0.0% [4000Hz
>>>>> cycles],  (all, 56 CPUs)
>>>>> --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 
>>>>>
>>>>>
>>>>>        14.25%  [kernel]       [k] dst_release
>>>>>        14.17%  [kernel]       [k] skb_dst_force
>>>>>        13.41%  [kernel]       [k] rt_cache_valid
>>>>>        11.47%  [kernel]       [k] ip_finish_output2
>>>>>         7.01%  [kernel]       [k] do_raw_spin_lock
>>>>>         5.07%  [kernel]       [k] page_frag_free
>>>>>         3.47%  [mlx5_core]    [k] mlx5e_xmit
>>>>>         2.88%  [kernel]       [k] fib_table_lookup
>>>>>         2.43%  [mlx5_core]    [k] skb_from_cqe.isra.32
>>>>>         1.97%  [kernel]       [k] virt_to_head_page
>>>>>         1.81%  [mlx5_core]    [k] mlx5e_poll_tx_cq
>>>>>         0.93%  [kernel]       [k] __dev_queue_xmit
>>>>>         0.87%  [kernel]       [k] __build_skb
>>>>>         0.84%  [kernel]       [k] ipt_do_table
>>>>>         0.79%  [kernel]       [k] ip_rcv
>>>>>         0.79%  [kernel]       [k] acpi_processor_ffh_cstate_enter
>>>>>         0.78%  [kernel]       [k] netif_skb_features
>>>>>         0.73%  [kernel]       [k] __netif_receive_skb_core
>>>>>         0.52%  [kernel]       [k] dev_hard_start_xmit
>>>>>         0.52%  [kernel]       [k] build_skb
>>>>>         0.51%  [kernel]       [k] ip_route_input_rcu
>>>>>         0.50%  [kernel]       [k] skb_unref
>>>>>         0.49%  [kernel]       [k] ip_forward
>>>>>         0.48%  [mlx5_core]    [k] mlx5_cqwq_get_cqe
>>>>>         0.44%  [kernel]       [k] udp_v4_early_demux
>>>>>         0.41%  [kernel]       [k] napi_consume_skb
>>>>>         0.40%  [kernel]       [k] __local_bh_enable_ip
>>>>>         0.39%  [kernel]       [k] ip_rcv_finish
>>>>>         0.39%  [kernel]       [k] kmem_cache_alloc
>>>>>         0.38%  [kernel]       [k] sch_direct_xmit
>>>>>         0.33%  [kernel]       [k] validate_xmit_skb
>>>>>         0.32%  [mlx5_core]    [k] mlx5e_free_rx_wqe_reuse
>>>>>         0.29%  [kernel]       [k] netdev_pick_tx
>>>>>         0.28%  [mlx5_core]    [k] mlx5e_build_rx_skb
>>>>>         0.27%  [kernel]       [k] deliver_ptype_list_skb
>>>>>         0.26%  [kernel]       [k] fib_validate_source
>>>>>         0.26%  [mlx5_core]    [k] mlx5e_napi_poll
>>>>>         0.26%  [mlx5_core]    [k] mlx5e_handle_rx_cqe
>>>>>         0.26%  [mlx5_core]    [k] mlx5e_rx_cache_get
>>>>>         0.25%  [kernel]       [k] eth_header
>>>>>         0.23%  [kernel]       [k] skb_network_protocol
>>>>>         0.20%  [kernel]       [k] nf_hook_slow
>>>>>         0.20%  [kernel]       [k] vlan_passthru_hard_header
>>>>>         0.20%  [kernel]       [k] vlan_dev_hard_start_xmit
>>>>>         0.19%  [kernel]       [k] swiotlb_map_page
>>>>>         0.18%  [kernel]       [k] compound_head
>>>>>         0.18%  [kernel]       [k] neigh_connected_output
>>>>>         0.18%  [mlx5_core]    [k] mlx5e_alloc_rx_wqe
>>>>>         0.18%  [kernel]       [k] ip_output
>>>>>         0.17%  [kernel]       [k] prefetch_freepointer.isra.70
>>>>>         0.17%  [kernel]       [k] __slab_free
>>>>>         0.16%  [kernel]       [k] eth_type_vlan
>>>>>         0.16%  [kernel]       [k] ip_finish_output
>>>>>         0.15%  [kernel]       [k] kmem_cache_free_bulk
>>>>>         0.14%  [kernel]       [k] netif_receive_skb_internal
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> wondering why this:
>>>>>         1.97%  [kernel]       [k] virt_to_head_page
>>>>> is in top...
>>>> This is related to the page_frag_free() call, but it is weird that it
>>>> shows up because it is suppose to be inlined (it is explicitly marked
>>>> inline in include/linux/mm.h).
>>>>
>>>>>>>>>> perf top:
>>>>>>>>>>
>>>>>>>>>>       PerfTop:   77835 irqs/sec  kernel:99.7%
>>>>>>>>>> ---------------------------------------------
>>>>>>>>>>
>>>>>>>>>>          16.32%  [kernel]       [k] skb_dst_force
>>>>>>>>>>          16.30%  [kernel]       [k] dst_release
>>>>>>>>>>          15.11%  [kernel]       [k] rt_cache_valid
>>>>>>>>>>          12.62%  [kernel]       [k] ipv4_mtu
>>>>>>>>> It seems a little strange that these 4 functions are on the top
>>>>>> I don't see these in my test.
>>>>>>>>>>           5.60% [kernel]       [k] do_raw_spin_lock
>>>>>>>>> Why is calling/taking this lock? (Use perf call-graph recording).
>>>>>>>> can be hard to paste it here:)
>>>>>>>> attached file
>>>>>> The attached was very big. Please don't attach so big file on 
>>>>>> mailing
>>>>>> lists.  Next time plase share them via e.g. pastebin. The output 
>>>>>> was a
>>>>>> capture from your terminal, which made the output more difficult to
>>>>>> read.  Hint: You can/could use perf --stdio and place it in a file
>>>>>> instead.
>>>>>>
>>>>>> The output (extracted below) didn't show who called 
>>>>>> 'do_raw_spin_lock',
>>>>>> BUT it showed another interesting thing.  The kernel code
>>>>>> __dev_queue_xmit() in might create route dst-cache problem for 
>>>>>> itself(?),
>>>>>> as it will first call skb_dst_force() and then skb_dst_drop() 
>>>>>> when the
>>>>>> packet is transmitted on a VLAN.
>>>>>>
>>>>>>     static int __dev_queue_xmit(struct sk_buff *skb, void 
>>>>>> *accel_priv)
>>>>>>     {
>>>>>>     [...]
>>>>>>     /* If device/qdisc don't need skb->dst, release it right now 
>>>>>> while
>>>>>>      * its hot in this cpu cache.
>>>>>>      */
>>>>>>     if (dev->priv_flags & IFF_XMIT_DST_RELEASE)
>>>>>>         skb_dst_drop(skb);
>>>>>>     else
>>>>>>         skb_dst_force(skb);
>>>>>>
>>
>
>

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ