netdev - Re: Kernel 4.13.0-rc4-next-20170811 - IP Routing / Forwarding performance vs Core/RSS number / HT on

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:   Mon, 14 Aug 2017 17:07:16 +0200
From:   Paweł Staszewski <pstaszewski@...are.pl>
To:     Alexander Duyck <alexander.duyck@...il.com>
Cc:     Jesper Dangaard Brouer <brouer@...hat.com>,
        Linux Kernel Network Developers <netdev@...r.kernel.org>
Subject: Re: Kernel 4.13.0-rc4-next-20170811 - IP Routing / Forwarding
 performance vs Core/RSS number / HT on



W dniu 2017-08-14 o 02:07, Alexander Duyck pisze:
> On Sat, Aug 12, 2017 at 10:27 AM, Paweł Staszewski
> <pstaszewski@...are.pl> wrote:
>> Hi and thanks for reply
>>
>>
>>
>> W dniu 2017-08-12 o 14:23, Jesper Dangaard Brouer pisze:
>>> On Fri, 11 Aug 2017 19:51:10 +0200 Paweł Staszewski
>>> <pstaszewski@...are.pl> wrote:
>>>
>>>> Hi
>>>>
>>>> I made some tests for performance comparison.
>>> Thanks for doing this. Feel free to Cc me, if you do more of these
>>> tests (so I don't miss them on the mailing list).
>>>
>>> I don't understand stand if you are reporting a potential problem?
>>>
>>> It would be good if you can provide a short summary section (of the
>>> issue) in the _start_ of the email, and then provide all this nice data
>>> afterwards, to back your case.
>>>
>>> My understanding is, you report:
>>>
>>> 1. VLANs on ixgbe show a 30-40% slowdown
>>> 2. System stopped scaling after 7+ CPUs
> So I had read through most of this before I realized what it was you
> were reporting. As far as the behavior there are a few things going
> on. I have some additional comments below but they are mostly based on
> what I had read up to that point.
>
> As far as possible issues for item 1. The VLAN adds 4 bytes of data of
> the payload, when it is stripped it can result in a packet that is 56
> bytes. These missing 8 bytes can cause issues as it forces the CPU to
> do a read/modify/write every time the device writes to the 64B cache
> line instead of just doing it as a single write. This can be very
> expensive and hurt performance. In addition it adds 4 bytes on the
> wire, so if you are sending the same 64B packets over the VLAN
> interface it is bumping them up to 68B to make room for the VLAN tag.
> I am suspecting you are encountering one of these type of issues. You
> might try tweaking the packet sizes in increments of 4 to see if there
> is a sweet spot that you might be falling out of or into.
No this is not a problem with 4byte header or soo

Cause topology is like this

TX generator (pktgen) physical interface no vlan -> RX physical 
interface (no vlan) [ FORWARDING HOST ] TX vlan interface binded to 
physical interface -> SINK

below data for packet size 70 (pktgen PKT_SIZE: 70)

ID;CPU_CORES / RSS QUEUES;PKT_SIZE;PPS_RX;BPS_RX;PPS_TX;BPS_TX

0;16;70;7246720;434749440;7245917;420269856
1;16;70;7249152;434872320;7248885;420434344
2;16;70;7249024;434926080;7249225;420401400
3;16;70;7249984;434952960;7249448;420435736
4;16;70;7251200;435064320;7250990;420495244
5;16;70;7241408;434592000;7241781;420068074
6;16;70;7229696;433689600;7229750;419268196
7;16;70;7236032;434127360;7236133;419669092
8;16;70;7236608;434161920;7236274;419695830
9;16;70;7226496;433578240;7227501;419107826

100% cpu load on all 16 cores

the difference vlan/no vlan currently on this host varries from 40 to 
even 50% (but cant check if can reach 50% performance degradation cause 
pktgen can give me only 10Mpps with 70% of cpu load for forwarding host 
(soo still place to forward maybee at line rate 14Mpps)

> Item 2 is a known issue with the NICs supported by ixgbe, at least for
> anything 82599 and later. The issue here is that there isn't really an
> Rx descriptor cache so to try and optimize performance the hardware
> will try to write back as many descriptors it has ready for the ring
> requesting writeback. The problem is as you add more rings it means
> the writes get smaller as they are triggering more often. So what you
> end up seeing is that for each additional ring you add the performance
> starts dropping as soon as the rings are no longer being fully
> saturated. You can tell this has happened when the CPUs in use
> suddenly all stop reporting 100% softirq use. So for example to
> perform at line rate with 64B packets you would need something like
> XDP and to keep the ring count small, like maybe 2 rings. Any more
> than that and the performance will start to drop as you hit PCIe
> bottlenecks.
>
>> This is not only problem/bug report  - but some kind of comparision plus
>> some toughts about possible problems :)
>> And can help somebody when searching the net for possible expectations :)
>> Also - dono better list where are the smartest people that know what is
>> going in kernel with networking :)
>>
>> Next time i will place summary on top - sorry :)
>>
>>>> Tested HW (FORWARDING HOST):
>>>>
>>>> Intel(R) Xeon(R) Gold 6132 CPU @ 2.60GHz
>>> Interesting, I've not heard about a Intel CPU called "Gold" before now,
>>> but it does exist:
>>>
>>> https://ark.intel.com/products/123541/Intel-Xeon-Gold-6132-Processor-19_25M-Cache-2_60-GHz
>>>
>>>
>>>> Intel Corporation 82599ES 10-Gigabit SFI/SFP+ Network Connection (rev 01)
>>> This is one of my all time favorite NICs!
>> Yes this is a good NIC - will have connectx-4 2x100G by monday so will also
>> do some tests
>>
>>>> Test diagram:
>>>>
>>>>
>>>> TRAFFIC GENERATOR (ethX) -> (enp216s0f0 - RX Traffic) FORWARDING HOST
>>>> (enp216s0f1(vlan1000) - TX Traffic) -> (ethY) SINK
>>>>
>>>> Forwarder traffic: UDP random ports from 9 to 19 with random hosts from
>>>> 172.16.0.1 to 172.16.0.255
>>>>
>>>> TRAFFIC GENERATOR TX is stable 9.9Mpps (in kernel pktgen)
>>> What kind of traffic flow?  E.g. distribution, many/few source IPs...
>>
>> Traffic generator is pktgen so udp flows - better paste parameters from
>> pktgen:
>>      UDP_MIN=9
>>      UDP_MAX=19
>>
>>      pg_set $dev "dst_min 172.16.0.1"
>>      pg_set $dev "dst_max 172.16.0.100"
>>
>>      # Setup random UDP port src range
>>      #pg_set $dev "flag UDPSRC_RND"
>>      pg_set $dev "flag UDPSRC_RND"
>>      pg_set $dev "udp_src_min $UDP_MIN"
>>      pg_set $dev "udp_src_max $UDP_MAX"
>>
>>
>>>
>>>> Settings used for FORWARDING HOST (changed param. was only number of RSS
>>>> combined queues + set affinity assignment for them to fit with first
>>>> numa node where 2x10G port card is installed)
>>>>
>>>> ixgbe driver used from kernel (in-kernel build - not a module)
>>>>
>>> Nice with a script showing you setup, thanks. I would be good if it had
>>> comments, telling why you think this is a needed setup adjustment.
>>>
>>>> #!/bin/sh
>>>> ifc='enp216s0f0 enp216s0f1'
>>>> for i in $ifc
>>>>            do
>>>>            ip link set up dev $i
>>>>            ethtool -A $i autoneg off rx off tx off
>>> Good:
>>>    Turning off Ethernet flow control, to avoid receiver being the
>>>    bottleneck via pause-frames.
>> Yes - enabled flow controll is really bad :)
>>>>            ethtool -G $i rx 4096 tx 1024
>>> You adjust the RX and TX ring queue sizes, this have effects that you
>>> don't realize.  Especially for the ixgbe driver, which have a page
>>> recycle trick tied to the RX ring queue size.
>> rx ring 4096 and tx ring 1024
>> - this is because have best performance then with average packet size from
>> 64 to 1500 bytes
> The problem is this has huge negative effects on the CPU caches.
> Generally less is more. When I perform tests I will usually drop the
> ring size for Tx to 128 and Rx to 256. That reduces the descriptor
> caches per ring to 1 page each for the Tx and Rx. With an increased
> interrupt rate you should be able to service this optimally without
> too much issue.
>
> Also for these type of tests the Tx ring never really gets over 64
> packets anyway since a single Tx ring is always populated by a single
> Rx ring so as long as there isn't any flow control in play the Tx
> queue should always be empty when the Rx clean-up begins and it will
> only be populated with up to NAPI poll weight worth of packets.
Will check different ring sizes also to compare this.


>
>> Can be a little better performance for smaller frames like 64 - with rx ring
>> set to 1024
>> below 1 core/1 RSS queue with rx ring set to 1024
>>
>> 0;1;64;1530112;91772160;1529919;88724208
>> 1;1;64;1531584;91872000;1531520;88813196
>> 2;1;64;1531392;91895040;1531262;88831930
>> 3;1;64;1530880;91875840;1531201;88783558
>> 4;1;64;1530688;91829760;1530688;88768826
>> 5;1;64;1530432;91810560;1530624;88764940
>> 6;1;64;1530880;91868160;1530878;88787328
>> 7;1;64;1530496;91845120;1530560;88765114
>> 8;1;64;1530496;91837440;1530687;88772538
>> 9;1;64;1530176;91795200;1530496;88735360
>>
>> so from 1.47Mpps to 1.53Mpps
>>
>> But with bigger packets > 200 performance is better when rx is set to 4096
> This is likely due to the interrupt moderation on the adapter. Instead
> of adjusting the ring size up you might try pushing the time between
> interrupts down. I have generally found around 25 usecs is best. You
> can change the rx-usecs value via ethtool -C to get the rate you want.
> You should find that it will perform better that way since you put
> less stress on the CPU caches.
>
>>>>            ip link set $i txqueuelen 1000
>>> Setting tx queue len to the default 1000 seems redundant.
>> Yes cause i'm changing this parameter also to see if any impact on
>> performance we have
>>>
>>>>            ethtool -C $i rx-usecs 10
>>> Adjusting this also have effects you might not realize.  This actually
>>> also affect the page recycle scheme of ixgbe.  And can sometimes be
>>> used to solve stalling on DMA TX completions, which could be you issue
>>> here.
>> same here - rx-usecs - setting to 10 was kind of compromise to have good
>> performance with big ans small packet sizes
> >From my personal experience I can say that 10 is probably too
> aggressive. The logic for trying to find an ideal interrupt rate for
> these kind of tests is actually pretty simple. What you want to do is
> have the updates coming fast enough that you never hit the point of
> descriptor starvation, but at the same time you don't want them coming
> too quickly otherwise you limit how many descriptors can be coalesced
> into a single PCI DMA write since the descriptors have to be flushed
> when an interrupt is triggered.
>
>> Same test as above with rx ring 1024 tx ring 1024 and rxusecs set to 256
>> (1Core/1RSS queue):
>> 0;1;64;1506304;90424320;1506626;87402868
>> 1;1;64;1505536;90343680;1504830;87321088
>> 2;1;64;1506880;90416640;1507522;87388120
>> 3;1;64;1511040;90700800;1511682;87684864
>> 4;1;64;1511040;90681600;1511102;87662476
>> 5;1;64;1511488;90712320;1511614;87673728
>> 6;1;64;1511296;90700800;1511038;87669900
>> 7;1;64;1513344;90773760;1513280;87751680
>> 8;1;64;1513536;90850560;1513470;87807360
>> 9;1;64;1512128;90696960;1512000;87696000
>>
>> And rx-usecs set to 1
>> 0;1;64;1533632;92037120;1533504;88954368
>> 1;1;64;1533632;92006400;1533570;88943348
>> 2;1;64;1533504;91994880;1533504;88931980
>> 3;1;64;1532864;91979520;1532674;88902516
>> 4;1;64;1533952;92044800;1534080;88961792
>> 5;1;64;1533888;92048640;1534270;88969100
>> 6;1;64;1533952;92037120;1534082;88969216
>> 7;1;64;1533952;92021760;1534208;88969332
>> 8;1;64;1533056;91983360;1532930;88883724
>> 9;1;64;1533760;92021760;1533886;88946828
>>
>> rx-useck set to 2
>> 0;1;64;1522432;91334400;1522304;88301056
>> 1;1;64;1521920;91330560;1522496;88286208
>> 2;1;64;1522496;91322880;1522432;88304768
>> 3;1;64;1523456;91422720;1523649;88382762
>> 4;1;64;1527680;91676160;1527424;88601728
>> 5;1;64;1527104;91626240;1526912;88572032
>> 6;1;64;1527424;91641600;1527424;88590592
>> 7;1;64;1526336;91572480;1526912;88523776
>> 8;1;64;1527040;91637760;1526912;88579456
>> 9;1;64;1527040;91595520;1526784;88553472
>>
>> rx-usecs set to 3
>> 0;1;64;1526272;91549440;1526592;88527488
>> 1;1;64;1526528;91560960;1526272;88516352
>> 2;1;64;1525952;91580160;1525888;88527488
>> 3;1;64;1525504;91511040;1524864;88456960
>> 4;1;64;1526272;91568640;1526208;88494080
>> 5;1;64;1525568;91545600;1525312;88494080
>> 6;1;64;1526144;91584000;1526080;88512640
>> 7;1;64;1525376;91530240;1525376;88482944
>> 8;1;64;1526784;91607040;1526592;88549760
>> 9;1;64;1526208;91560960;1526528;88512640
>>
>>
>>>>            ethtool -L $i combined 16
>>>>            ethtool -K $i gro on tso on gso off sg on l2-fwd-offload off
>>>> tx-nocache-copy on ntuple on
>>> Here are many setting above.
>> Yes mostly NIC defaults besides the ntuple that is on (for testing some nfc
>> drop filters - and trying to test also tc-offload )
>>
>>> GRO/GSO/TSO for _forwarding_ is actually bad... in my tests, enabling
>>> this result in approx 10% slowdown.
>> Ok lets give a try :)
>> gro off tso off gso off sg on l2-fwd-offload off tx-nocache-copy on ntuple
>> on
>> rx-usecs 10
>> 1 CPU / 1 RSS QUEUE
>>
>> 0;1;64;1609344;96537600;1609279;93327104
>> 1;1;64;1608320;96514560;1608256;93293812
>> 2;1;64;1608000;96487680;1608125;93267770
>> 3;1;64;1608320;96522240;1608576;93297524
>> 4;1;64;1605888;96387840;1606211;93148986
>> 5;1;64;1601472;96072960;1601600;92870644
>> 6;1;64;1602624;96180480;1602243;92959674
>> 7;1;64;1601728;96107520;1602113;92907764
>> 8;1;64;1602176;96122880;1602176;92933806
>> 9;1;64;1603904;96253440;1603777;93045208
>>
>> A little better performance 1.6Mpps
>> But wondering if disabling tso will have no performance impact for tcp
>> traffic ...
> If you were passing TCP traffic through the router GRO/TSO would
> impact things, but for UDP it just adds overhead.
>
>> Will try to get some pktgen like pktgen-dpdk that can generate also tcp
>> traffic - to compare this.
>>
>>
>>> AFAIK "tx-nocache-copy on" was also determined to be a bad option.
>> I set this to on cause i have better performance (a little 10kpps for this
>> test)
>> below same test as above  with tx-nocache-copy off
>>
>> 0;1;64;1591552;95496960;1591230;92313654
>> 1;1;64;1596224;95738880;1595842;92555066
>> 2;1;64;1595456;95700480;1595201;92521774
>> 3;1;64;1595456;95723520;1595072;92528966
>> 4;1;64;1595136;95692800;1595457;92503040
>> 5;1;64;1594624;95631360;1594496;92473402
>> 6;1;64;1596224;95761920;1595778;92551180
>> 7;1;64;1595200;95700480;1595331;92521542
>> 8;1;64;1595584;95692800;1595457;92521426
>> 9;1;64;1594624;95662080;1594048;92469574
> If I recall it should have no actual impact one way or the other. The
> tx-nocache-copy option should only impact socket traffic, not routing
> since if I recall correctly it only impacts copies from userspace.
>
>>> The "ntuple on" AFAIK disables the flow-director in the NIC.  I though
>>> this would actually help VLAN traffic, but I guess not.
>> yes I enabled this cause was thinking that can help with traffic on vlans
>>
>> below same test with ntuple off
>> so all settings for ixgbe:
>> gro off tso off gso off sg on l2-fwd-offload off tx-nocache-copy off ntuple
>> off
>> rx-usecs 10
>> rx-flow-hash udp4 sdfn
>>
>> 0;1;64;1611840;96691200;1611905;93460794
>> 1;1;64;1610688;96645120;1610818;93427328
>> 2;1;64;1610752;96668160;1610497;93442176
>> 3;1;64;1610624;96664320;1610817;93427212
>> 4;1;64;1610752;96652800;1610623;93412480
>> 5;1;64;1610048;96614400;1610112;93404940
>> 6;1;64;1611264;96641280;1611390;93427212
>> 7;1;64;1611008;96691200;1610942;93468160
>> 8;1;64;1610048;96652800;1609984;93408652
>> 9;1;64;1611136;96641280;1610690;93434636
>>
>> Performance is a little better
>> and now with tx-nocache-copy on
>>
>> 0;1;64;1597248;95834880;1597311;92644096
>> 1;1;64;1597888;95865600;1597824;92677446
>> 2;1;64;1597952;95834880;1597822;92644038
>> 3;1;64;1597568;95877120;1597375;92685044
>> 4;1;64;1597184;95827200;1597314;92629190
>> 5;1;64;1597696;95842560;1597565;92625652
>> 6;1;64;1597312;95834880;1597376;92644038
>> 7;1;64;1597568;95873280;1597634;92647924
>> 8;1;64;1598400;95919360;1598849;92699602
>> 9;1;64;1597824;95873280;1598208;92684928
>>
>>
>> That is weird - so enabling tx-nocache-copy with disabled ntuple have bad
>> performance impact - but with enabled ntuple there is no performance impact
> I would leave the ntuple feature enabled if you are routing simply
> because that disables the ixgbe feature ATR which can have a negative
> impact on routing tests (causes reordering).
>
>>>
>>>>            ethtool -N $i rx-flow-hash udp4 sdfn
>>> Why do you change the NICs flow-hash?
>> whan used 16 cores / 16 rss queues - there was better load distribution over
>> all cores when sdfn rx-flow-hash enabled
> That is to be expected. The default hash will only has on IPv4
> addresses. Enabling the use of UDP ports would allow for more entropy.
> If you want similar performance without resorting to hashing on ports
> you would have to change the source/destination IP addresses.
>
>>>>            done
>>>>
>>>> ip link set up dev enp216s0f0
>>>> ip link set up dev enp216s0f1
>>>>
>>>> ip a a 10.0.0.1/30 dev enp216s0f0
>>>>
>>>> ip link add link enp216s0f1 name vlan1000 type vlan id 1000
>>>> ip link set up dev vlan1000
>>>> ip a a 10.0.0.5/30 dev vlan1000
>>>>
>>>>
>>>> ip route add 172.16.0.0/12 via 10.0.0.6
>>>>
>>>> ./set_irq_affinity.sh -x 14-27,42-43 enp216s0f0
>>>> ./set_irq_affinity.sh -x 14-27,42-43 enp216s0f1
>>>> #cat  /sys/devices/system/node/node1/cpulist
>>>> #14-27,42-55
>>>> #cat  /sys/devices/system/node/node0/cpulist
>>>> #0-13,28-41
>>> Is this a NUMA system?
>> This is 2x CPU 6132 - so have two separate pcie access to the nic - need to
>> check what cpu is assigned to pcie where network card is connected to have
>> network card on local cpu where all irq's are binded
>>
>>>
>>>> #################################################
>>>>
>>>>
>>>> Looks like forwarding performance when using vlans on ixgbe is less that
>>>> without vlans for about 30-40% (wondering if this is some vlan
>>>> offloading problem and ixgbe)
>>> I would see this as a problem/bug that enabling VLANs cost this much.
>> Yes - was thinking that with tx/rx vlan offloading there will be not much
>> performance impact when vlans used.
> What is the rate difference? Also did you account for the header size
> when noticing that there is a difference in rates? I just want to make
> sure we aren't seen an issue where you are expecting a rate of
> 14.88Mpps when VLAN tags drop the rate due to header overhead down to
> something like 14.2Mpps if I recall correctly.

As reply above - the difference is
With vlan: 7Mpps (100%cpu on 16Cores with 16RSS queues)
Without vlan: 10Mpps (70% cpu load on 16Cores with 16RSS queues)

So this is really big difference.

>
>>>> settings below:
>>>>
>>>> ethtool -k enp216s0f0
>>>> Features for enp216s0f0:
>>>> Cannot get device udp-fragmentation-offload settings: Operation not
>>>> supported
>>>> rx-checksumming: on
>>>> tx-checksumming: on
>>>>            tx-checksum-ipv4: off [fixed]
>>>>            tx-checksum-ip-generic: on
>>>>            tx-checksum-ipv6: off [fixed]
>>>>            tx-checksum-fcoe-crc: off [fixed]
>>>>            tx-checksum-sctp: on
>>>> scatter-gather: on
>>>>            tx-scatter-gather: on
>>>>            tx-scatter-gather-fraglist: off [fixed]
>>>> tcp-segmentation-offload: on
>>>>            tx-tcp-segmentation: on
>>>>            tx-tcp-ecn-segmentation: off [fixed]
>>>>            tx-tcp-mangleid-segmentation: on
>>>>            tx-tcp6-segmentation: on
>>>> udp-fragmentation-offload: off
>>>> generic-segmentation-offload: off
>>>> generic-receive-offload: on
>>>> large-receive-offload: off
>>>> rx-vlan-offload: on
>>>> tx-vlan-offload: on
>>>> ntuple-filters: on
>>>> receive-hashing: on
>>>> highdma: on [fixed]
>>>> rx-vlan-filter: on
>>>> vlan-challenged: off [fixed]
>>>> tx-lockless: off [fixed]
>>>> netns-local: off [fixed]
>>>> tx-gso-robust: off [fixed]
>>>> tx-fcoe-segmentation: off [fixed]
>>>> tx-gre-segmentation: on
>>>> tx-gre-csum-segmentation: on
>>>> tx-ipxip4-segmentation: on
>>>> tx-ipxip6-segmentation: on
>>>> tx-udp_tnl-segmentation: on
>>>> tx-udp_tnl-csum-segmentation: on
>>>> tx-gso-partial: on
>>>> tx-sctp-segmentation: off [fixed]
>>>> tx-esp-segmentation: off [fixed]
>>>> fcoe-mtu: off [fixed]
>>>> tx-nocache-copy: on
>>>> loopback: off [fixed]
>>>> rx-fcs: off [fixed]
>>>> rx-all: off
>>>> tx-vlan-stag-hw-insert: off [fixed]
>>>> rx-vlan-stag-hw-parse: off [fixed]
>>>> rx-vlan-stag-filter: off [fixed]
>>>> l2-fwd-offload: off
>>>> hw-tc-offload: off
>>>> esp-hw-offload: off [fixed]
>>>> esp-tx-csum-hw-offload: off [fixed]
>>>> rx-udp_tunnel-port-offload: on
>>>>
>>>>
>>>> Another thing is that forwarding performance does not scale with number
>>>> of cores when 7+ cores are reached
>>> I've seen problems with using Hyper-Threading CPUs.  Could it be that
>>> above 7 CPUs you are starting to use sibling-cores ?
>>>
> I would suspect that it may be more than likely the case. One thing
> you might look at doing is CPU pinning the interrupts for the NIC in a
> 1:1 fashion so that the queues are all bound to separate cores without
> them sharing between Hyper-threads.
>
>> Turbostats can help here:
>> Package Core    CPU     Avg_MHz Busy%   Bzy_MHz TSC_MHz IRQ SMI     C1
>> C2      C1%     C2%     CPU%c1  CPU%c6  CoreTmp PkgTmp  PkgWatt RAMWatt
>> PKG_%   RAM_%
>> -       -       -       72      2.27    3188    2600    194844 0       64
>> 69282   0.07    97.83   18.38   79.36   -4 54      123.49  16.08   0.00
>> 0.00
>> 0       0       0       8       0.74    1028    2600    1513 0       32
>> 1462    1.50    97.99   10.92   88.34   47 51      58.34   5.34    0.00
>> 0.00
>> 0       0       28      7       0.67    1015    2600    1255 0       12
>> 1249    0.96    98.61   10.99
>> 0       1       1       7       0.68    1019    2600    1260 0       0
>> 1260    0.00    99.54   8.44    90.88   49
>> 0       1       29      9       0.71    1208    2600    1252 0       0
>> 1253    0.00    99.48   8.41
>> 0       2       2       7       0.67    1019    2600    1261 0       0
>> 1260    0.00    99.54   8.44    90.89   48
>> 0       2       30      7       0.67    1017    2600    1255 0       0
>> 1255    0.00    99.55   8.44
>> 0       3       3       7       0.68    1019    2600    1260 0       0
>> 1259    0.00    99.53   8.46    90.86   -4
>> 0       3       31      7       0.67    1017    2600    1256 0       0
>> 1256    0.00    99.55   8.46
>> 0       4       4       7       0.67    1027    2600    1260 0       0
>> 1260    0.00    99.54   8.43    90.90   -4
>> 0       4       32      7       0.66    1018    2600    1255 0       0
>> 1255    0.00    99.55   8.44
>> 0       5       5       7       0.68    1020    2600    1260 0       0
>> 1257    0.00    99.54   8.44    90.89   50
>> 0       5       33      7       0.68    1019    2600    1255 0       0
>> 1255    0.00    99.55   8.43
>> 0       6       6       7       0.70    1019    2600    1260 0       0
>> 1259    0.00    99.53   8.43    90.87   -4
>> 0       6       34      7       0.70    1019    2600    1255 0       0
>> 1255    0.00    99.54   8.43
>> 0       8       7       7       0.68    1019    2600    1262 0       0
>> 1261    0.00    99.52   8.42    90.90   50
>> 0       8       35      7       0.67    1019    2600    1255 0       0
>> 1255    0.00    99.55   8.43
>> 0       9       8       7       0.68    1019    2600    1260 0       0
>> 1257    0.00    99.54   8.40    90.92   49
>> 0       9       36      7       0.66    1017    2600    1255 0       0
>> 1255    0.00    99.55   8.41
>> 0       10      9       7       0.66    1018    2600    1257 0       0
>> 1257    0.00    99.54   8.40    90.94   -4
>> 0       10      37      7       0.66    1018    2600    1255 0       0
>> 1255    0.00    99.55   8.41
>> 0       11      10      7       0.66    1019    2600    1257 0       0
>> 1259    0.00    99.54   8.56    90.77   -4
>> 0       11      38      7       0.66    1018    2600    1255 0       3
>> 1252    0.19    99.36   8.57
>> 0       12      11      7       0.67    1019    2600    1260 0       0
>> 1260    0.00    99.54   8.44    90.88   -4
>> 0       12      39      7       0.67    1019    2600    1255 0       0
>> 1256    0.00    99.55   8.44
>> 0       13      12      7       0.68    1019    2600    1257 0       4
>> 1254    0.32    99.22   8.67    90.65   -4
>> 0       13      40      7       0.69    1019    2600    1256 0       4
>> 1253    0.24    99.31   8.66
>> 0       14      13      7       0.71    1020    2600    1260 0       0
>> 1259    0.00    99.53   8.41    90.88   -4
>> 0       14      41      7       0.72    1020    2600    1255 0       0
>> 1255    0.00    99.54   8.40
>> 1       0       14      3564    99.19   3594    2600    125472 0       0
>> 0       0.00    0.00    0.81    0.00    54 54      65.15   10.74   0.00
>> 0.00
>> 1       0       42      3       0.07    3701    2600    1255 0       0
>> 1255    0.00    99.95   99.93
>> 1       1       15      11      0.32    3301    2600    1257 0       0
>> 1257    0.00    99.81   26.37   73.31   42
>> 1       1       43      10      0.31    3301    2600    1255 0       0
>> 1255    0.00    99.82   26.38
>> 1       2       16      10      0.31    3301    2600    1257 0       0
>> 1257    0.00    99.81   26.37   73.32   39
>> 1       2       44      10      0.32    3301    2600    1255 0       0
>> 1255    0.00    99.82   26.36
>> 1       3       17      10      0.32    3301    2600    1257 0       0
>> 1257    0.00    99.81   26.40   73.28   39
>> 1       3       45      11      0.32    3301    2600    1255 0       0
>> 1255    0.00    99.81   26.40
>> 1       4       18      10      0.32    3301    2600    1257 0       0
>> 1257    0.00    99.82   26.40   73.28   40
>> 1       4       46      11      0.32    3301    2600    1255 0       0
>> 1255    0.00    99.82   26.40
>> 1       5       19      11      0.33    3301    2600    1257 0       0
>> 1257    0.00    99.81   26.40   73.27   39
>> 1       5       47      11      0.33    3300    2600    1255 0       0
>> 1255    0.00    99.82   26.40
>> 1       6       20      12      0.35    3301    2600    1257 0       0
>> 1257    0.00    99.81   26.38   73.27   42
>> 1       6       48      12      0.36    3301    2600    1255 0       0
>> 1255    0.00    99.81   26.37
>> 1       8       21      11      0.33    3301    2600    1257 0       0
>> 1257    0.00    99.82   26.37   73.29   42
>> 1       8       49      11      0.33    3301    2600    1255 0       0
>> 1255    0.00    99.82   26.38
>> 1       9       22      10      0.32    3300    2600    1257 0       0
>> 1257    0.00    99.82   26.35   73.34   41
>> 1       9       50      10      0.30    3301    2600    1255 0       0
>> 1255    0.00    99.82   26.36
>> 1       10      23      10      0.31    3301    2600    1257 0       0
>> 1257    0.00    99.82   26.37   73.33   41
>> 1       10      51      10      0.31    3301    2600    1255 0       0
>> 1255    0.00    99.82   26.36
>> 1       11      24      10      0.32    3301    2600    1257 0       0
>> 1257    0.00    99.81   26.62   73.06   41
>> 1       11      52      10      0.32    3301    2600    1255 0       4
>> 1251    0.32    99.50   26.62
>> 1       12      25      11      0.33    3301    2600    1257 0       0
>> 1257    0.00    99.81   26.39   73.28   41
>> 1       12      53      11      0.33    3301    2600    1258 0       0
>> 1254    0.00    99.82   26.38
>> 1       13      26      12      0.36    3317    2600    1259 0       0
>> 1258    0.00    99.79   26.41   73.23   39
>> 1       13      54      11      0.34    3301    2600    1255 0       0
>> 1254    0.00    99.82   26.42
>> 1       14      27      12      0.36    3301    2600    1257 0       5
>> 1251    0.24    99.58   26.54   73.10   41
>> 1       14      55      12      0.36    3300    2600    1255 0       0
>> 1254    0.00    99.82   26.54
>>
>>
>> So it looks like in all tests i'm using core+sibling
>> But side effect of this is that :
>> 33 * 100.0 = 3300.0 MHz max turbo 28 active cores
>> 33 * 100.0 = 3300.0 MHz max turbo 24 active cores
>> 33 * 100.0 = 3300.0 MHz max turbo 20 active cores
>> 33 * 100.0 = 3300.0 MHz max turbo 14 active cores
>> 34 * 100.0 = 3400.0 MHz max turbo 12 active cores
>> 34 * 100.0 = 3400.0 MHz max turbo 8 active cores
>> 35 * 100.0 = 3500.0 MHz max turbo 4 active cores
>> 37 * 100.0 = 3700.0 MHz max turbo 2 active cores
>>
>> So more cores = less MHz per core/sibling
> Yes that is always a trade off. Also the ixgbe is limited in terms of
> PCIe bus bandwidth. The more queues you add the worse the descriptor
> overhead will be. Generally I have found that about 6 queues is ideal.
> As you start getting to more than 8 the performance for 64B packets
> will start to drop off as each additional queue will hurt the
> descriptor cache performance as it starts writing back fewer and fewer
> descriptors per write which will increase the PCIe bus overhead for
> the writes.
>
>>>> perf top:
>>>>
>>>>     PerfTop:   77835 irqs/sec  kernel:99.7%  exact:  0.0% [4000Hz
>>>> cycles],  (all, 56 CPUs)
>>>>
>>>> ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>>>>
>>>>        16.32%  [kernel]       [k] skb_dst_force
>>>>        16.30%  [kernel]       [k] dst_release
>>>>        15.11%  [kernel]       [k] rt_cache_valid
>>>>        12.62%  [kernel]       [k] ipv4_mtu
>>> It seems a little strange that these 4 functions are on the top
>> Yes dono why there is ipv4_mtu called and taking soo much cycles
>>
>>>>         5.60%  [kernel]       [k] do_raw_spin_lock
>>> Why is calling/taking this lock? (Use perf call-graph recording).
>> can be hard to paste it here:)
>> attached file
>>
>>>>         3.03%  [kernel]       [k] fib_table_lookup
>>>>         2.70%  [kernel]       [k] ip_finish_output2
>>>>         2.10%  [kernel]       [k] dev_gro_receive
>>>>         1.89%  [kernel]       [k] eth_type_trans
>>>>         1.81%  [kernel]       [k] ixgbe_poll
>>>>         1.15%  [kernel]       [k] ixgbe_xmit_frame_ring
>>>>         1.06%  [kernel]       [k] __build_skb
>>>>         1.04%  [kernel]       [k] __dev_queue_xmit
>>>>         0.97%  [kernel]       [k] ip_rcv
>>>>         0.78%  [kernel]       [k] netif_skb_features
>>>>         0.74%  [kernel]       [k] ipt_do_table
>>> Unloading netfilter modules, will give more performance, but it
>>> semifake to do so.
>> Compiled in kernel - only in filter mode - with ipv4+ipv6 - no other modules
>> conntrack or other .
>>
>>>>         0.70%  [kernel]       [k] acpi_processor_ffh_cstate_enter
>>>>         0.64%  [kernel]       [k] ip_forward
>>>>         0.59%  [kernel]       [k] __netif_receive_skb_core
>>>>         0.55%  [kernel]       [k] dev_hard_start_xmit
>>>>         0.53%  [kernel]       [k] ip_route_input_rcu
>>>>         0.53%  [kernel]       [k] ip_rcv_finish
>>>>         0.51%  [kernel]       [k] page_frag_free
>>>>         0.50%  [kernel]       [k] kmem_cache_alloc
>>>>         0.50%  [kernel]       [k] udp_v4_early_demux
>>>>         0.44%  [kernel]       [k] skb_release_data
>>>>         0.42%  [kernel]       [k] inet_gro_receive
>>>>         0.40%  [kernel]       [k] sch_direct_xmit
>>>>         0.39%  [kernel]       [k] __local_bh_enable_ip
>>>>         0.33%  [kernel]       [k] netdev_pick_tx
>>>>         0.33%  [kernel]       [k] validate_xmit_skb
>>>>         0.28%  [kernel]       [k] fib_validate_source
>>>>         0.27%  [kernel]       [k] deliver_ptype_list_skb
>>>>         0.25%  [kernel]       [k] eth_header
>>>>         0.23%  [kernel]       [k] get_dma_ops
>>>>         0.22%  [kernel]       [k] skb_network_protocol
>>>>         0.21%  [kernel]       [k] ip_output
>>>>         0.21%  [kernel]       [k] vlan_dev_hard_start_xmit
>>>>         0.20%  [kernel]       [k] ixgbe_alloc_rx_buffers
>>>>         0.18%  [kernel]       [k] nf_hook_slow
>>>>         0.18%  [kernel]       [k] apic_timer_interrupt
>>>>         0.18%  [kernel]       [k] virt_to_head_page
>>>>         0.18%  [kernel]       [k] build_skb
>>>>         0.16%  [kernel]       [k] swiotlb_map_page
>>>>         0.16%  [kernel]       [k] ip_finish_output
>>>>         0.16%  [kernel]       [k] udp4_gro_receive
>>>>
>>>>
>>>> RESULTS:
>>>>
>>>> CSV format - delimeter ";"
>>>>
>>>> ID;CPU_CORES / RSS QUEUES;PKT_SIZE;PPS_RX;BPS_RX;PPS_TX;BPS_TX
>>>> 0;1;64;1470912;88247040;1470720;85305530
>>>> 1;1;64;1470912;88285440;1470977;85335110
>>>> 2;1;64;1470464;88247040;1470402;85290508
>>>> 3;1;64;1471424;88262400;1471230;85353728
>>>> 4;1;64;1468736;88166400;1468672;85201652
>>>> 5;1;64;1470016;88181760;1469949;85234944
>>>> 6;1;64;1470720;88247040;1470466;85290624
>>>> 7;1;64;1471232;88277760;1471167;85346246
>>>> 8;1;64;1469184;88170240;1469249;85216326
>>>> 9;1;64;1470592;88227840;1470847;85294394
>>> Single core 1.47Mpps seems a little low, I would expect 2Mpps.
>>>
>>>> ID;CPU_CORES / RSS QUEUES;PKT_SIZE;PPS_RX;BPS_RX;PPS_TX;BPS_TX
>>>> 0;2;64;2413120;144802560;2413245;139975924
>>>> 1;2;64;2415296;144913920;2415356;140098188
>>>> 2;2;64;2416768;144898560;2416573;140105670
>>>> 3;2;64;2418176;145056000;2418110;140261806
>>>> 4;2;64;2416512;144990720;2416509;140172950
>>>> 5;2;64;2415168;144860160;2414466;140064780
>>>> 6;2;64;2416960;144983040;2416833;140190930
>>>> 7;2;64;2413632;144768000;2413568;140001734
>>>> 8;2;64;2415296;144898560;2414589;140087168
>>>> 9;2;64;2416576;144963840;2416892;140190930
>>>> ID;CPU_CORES / RSS QUEUES;PKT_SIZE;PPS_RX;BPS_RX;PPS_TX;BPS_TX
>>>> 0;3;64;3419008;205155840;3418882;198239244
>>>> 1;3;64;3428032;205585920;3427971;198744234
>>>> 2;3;64;3425472;205536000;3425344;198677260
>>>> 3;3;64;3425088;205470720;3425156;198603136
>>>> 4;3;64;3427648;205693440;3426883;198773888
>>>> 5;3;64;3426880;205670400;3427392;198796044
>>>> 6;3;64;3429120;205678080;3430140;198848186
>>>> 7;3;64;3422976;205355520;3423490;198458136
>>>> 8;3;64;3423168;205336320;3423486;198495372
>>>> 9;3;64;3424384;205493760;3425538;198617868
>>>> ID;CPU_CORES / RSS QUEUES;PKT_SIZE;PPS_RX;BPS_RX;PPS_TX;BPS_TX
>>>> 0;4;64;4406464;264364800;4405244;255560296
>>>> 1;4;64;4404672;264349440;4405122;255541504
>>>> 2;4;64;4402368;264049920;4403326;255188864
>>>> 3;4;64;4401344;264076800;4400702;255207134
>>>> 4;4;64;4385536;263074560;4386620;254312716
>>>> 5;4;64;4386560;263189760;4385404;254379532
>>>> 6;4;64;4398784;263857920;4399031;255025288
>>>> 7;4;64;4407232;264445440;4407998;255637900
>>>> 8;4;64;4413184;264698880;4413758;255875816
>>>> 9;4;64;4411328;264526080;4411906;255712372
>>>> ID;CPU_CORES / RSS QUEUES;PKT_SIZE;PPS_RX;BPS_RX;PPS_TX;BPS_TX
>>>> 0;5;64;5094464;305871360;5094464;295657262
>>>> 1;5;64;5090816;305514240;5091201;295274810
>>>> 2;5;64;5088384;305387520;5089792;295175108
>>>> 3;5;64;5079296;304869120;5079484;294680368
>>>> 4;5;64;5092992;305544960;5094207;295349166
>>>> 5;5;64;5092416;305502720;5093372;295334260
>>>> 6;5;64;5080896;304896000;5081090;294677004
>>>> 7;5;64;5085376;305114880;5086401;294933058
>>>> 8;5;64;5092544;305575680;5092036;295356938
>>>> 9;5;64;5093056;305652480;5093832;295449506
>>>> ID;CPU_CORES / RSS QUEUES;PKT_SIZE;PPS_RX;BPS_RX;PPS_TX;BPS_TX
>>>> 0;6;64;5705088;342351360;5705784;330965110
>>>> 1;6;64;5710272;342743040;5707591;331373952
>>>> 2;6;64;5703424;342182400;5701826;330776552
>>>> 3;6;64;5708736;342604800;5707963;331147462
>>>> 4;6;64;5710144;342654720;5712067;331202910
>>>> 5;6;64;5712064;342777600;5711361;331292288
>>>> 6;6;64;5710144;342585600;5708607;331144272
>>>> 7;6;64;5699840;342021120;5697853;330609222
>>>> 8;6;64;5701184;342124800;5702909;330653592
>>>> 9;6;64;5711360;342735360;5713283;331247686
>>>> ID;CPU_CORES / RSS QUEUES;PKT_SIZE;PPS_RX;BPS_RX;PPS_TX;BPS_TX
>>>> 0;7;64;6244416;374603520;6243591;362180072
>>>> 1;7;64;6230912;374016000;6231490;361534126
>>>> 2;7;64;6244800;374776320;6244866;362224326
>>>> 3;7;64;6238720;374376960;6238261;361838510
>>>> 4;7;64;6218816;373079040;6220413;360683962
>>>> 5;7;64;6224320;373566720;6225086;361017404
>>>> 6;7;64;6224000;373570560;6221370;360936088
>>>> 7;7;64;6210048;372741120;6210627;360212654
>>>> 8;7;64;6231616;374035200;6231537;361445502
>>>> 9;7;64;6227840;373724160;6228802;361162752
>>>> ID;CPU_CORES / RSS QUEUES;PKT_SIZE;PPS_RX;BPS_RX;PPS_TX;BPS_TX
>>>> 0;8;64;6251840;375144960;6251849;362609678
>>>> 1;8;64;6250816;375014400;6250881;362547038
>>>> 2;8;64;6257728;375432960;6257160;362911104
>>>> 3;8;64;6255552;375325440;6255622;362822074
>>>> 4;8;64;6243776;374576640;6243270;362120622
>>>> 5;8;64;6237184;374296320;6237690;361790080
>>>> 6;8;64;6240960;374415360;6240714;361927366
>>>> 7;8;64;6222784;373317120;6223746;360854424
>>>> 8;8;64;6225920;373593600;6227014;361154980
>>>> 9;8;64;6238528;374304000;6237701;361845238
>>>> ID;CPU_CORES / RSS QUEUES;PKT_SIZE;PPS_RX;BPS_RX;PPS_TX;BPS_TX
>>>> 0;14;64;6486144;389184000;6486135;376236488
>>>> 1;14;64;6454912;387390720;6454222;374466734
>>>> 2;14;64;6441152;386480640;6440431;373572780
>>>> 3;14;64;6450240;386972160;6450870;374070014
>>>> 4;14;64;6465600;387997440;6467221;375089654
>>>> 5;14;64;6448384;386860800;6448000;373980230
>>>> 6;14;64;6452352;387095040;6452148;374168904
>>>> 7;14;64;6441984;386507520;6443203;373665058
>>>> 8;14;64;6456704;387340800;6455744;374429092
>>>> 9;14;64;6464640;387901440;6465218;374949004
>>>> ID;CPU_CORES / RSS QUEUES;PKT_SIZE;PPS_RX;BPS_RX;PPS_TX;BPS_TX
>>>> 0;16;64;6939008;416325120;6938696;402411192
>>>> 1;16;64;6941952;416444160;6941745;402558918
>>>> 2;16;64;6960576;417584640;6960707;403698718
>>>> 3;16;64;6940736;416486400;6941820;402503876
>>>> 4;16;64;6927680;415741440;6927420;401853870
>>>> 5;16;64;6929792;415687680;6929917;401839196
>>>> 6;16;64;6950400;416989440;6950661;403026166
>>>> 7;16;64;6953664;417216000;6953454;403260544
>>>> 8;16;64;6948480;416851200;6948800;403023266
>>>> 9;16;64;6924160;415422720;6924092;401542468
>>> I've seen Linux scale beyond 6.9Mpps, thus I also see this as an
>>> issue/bug.  You could be stalling on DMA TX completion being too slow,
>>> but you already increased the interval and increased the TX ring queue
>>> size.  You could play with those setting and see if it changes this?
>>>
>>> Could you try my napi_monitor tool in:
>>>
>>> https://github.com/netoptimizer/prototype-kernel/tree/master/kernel/samples/bpf
>>>
>>> Also provide the output from:
>>>    mpstat -P ALL -u -I SCPU -I SUM 2
>> with 16 cores / 16 RSS queues
>> Average:     CPU    %usr   %nice    %sys %iowait    %irq   %soft %steal
>> %guest  %gnice   %idle
>> Average:     all    0.00    0.00    0.01    0.00    0.00   28.57 0.00
>> 0.00    0.00   71.42
>> Average:       0    0.00    0.00    0.04    0.00    0.00    0.08 0.00
>> 0.00    0.00   99.88
>> Average:       1    0.00    0.00    0.12    0.00    0.00    0.00 0.00
>> 0.00    0.00   99.88
>> Average:       2    0.00    0.00    0.00    0.00    0.00    0.00 0.00
>> 0.00    0.00  100.00
>> Average:       3    0.00    0.00    0.00    0.00    0.00    0.00 0.00
>> 0.00    0.00  100.00
>> Average:       4    0.00    0.00    0.00    0.00    0.00    0.00 0.00
>> 0.00    0.00  100.00
>> Average:       5    0.00    0.00    0.00    0.00    0.00    0.00 0.00
>> 0.00    0.00  100.00
>> Average:       6    0.00    0.00    0.00    0.00    0.00    0.00 0.00
>> 0.00    0.00  100.00
>> Average:       7    0.00    0.00    0.00    0.00    0.00    0.00 0.00
>> 0.00    0.00  100.00
>> Average:       8    0.00    0.00    0.00    0.00    0.00    0.00 0.00
>> 0.00    0.00  100.00
>> Average:       9    0.00    0.00    0.00    0.00    0.00    0.00 0.00
>> 0.00    0.00  100.00
>> Average:      10    0.00    0.00    0.00    0.00    0.00    0.00 0.00
>> 0.00    0.00  100.00
>> Average:      11    0.08    0.00    0.04    0.00    0.00    0.00 0.00
>> 0.00    0.00   99.88
>> Average:      12    0.00    0.00    0.00    0.00    0.00    0.00 0.00
>> 0.00    0.00  100.00
>> Average:      13    0.00    0.00    0.00    0.00    0.00    0.00 0.00
>> 0.00    0.00  100.00
>> Average:      14    0.00    0.00    0.00    0.00    0.00  100.00 0.00
>> 0.00    0.00    0.00
>> Average:      15    0.00    0.00    0.00    0.00    0.00  100.00 0.00
>> 0.00    0.00    0.00
>> Average:      16    0.00    0.00    0.00    0.00    0.00  100.00 0.00
>> 0.00    0.00    0.00
>> Average:      17    0.00    0.00    0.00    0.00    0.00  100.00 0.00
>> 0.00    0.00    0.00
>> Average:      18    0.00    0.00    0.00    0.00    0.00  100.00 0.00
>> 0.00    0.00    0.00
>> Average:      19    0.00    0.00    0.00    0.00    0.00  100.00 0.00
>> 0.00    0.00    0.00
>> Average:      20    0.00    0.00    0.00    0.00    0.00  100.00 0.00
>> 0.00    0.00    0.00
>> Average:      21    0.00    0.00    0.00    0.00    0.00  100.00 0.00
>> 0.00    0.00    0.00
>> Average:      22    0.00    0.00    0.00    0.00    0.00  100.00 0.00
>> 0.00    0.00    0.00
>> Average:      23    0.00    0.00    0.00    0.00    0.00  100.00 0.00
>> 0.00    0.00    0.00
>> Average:      24    0.00    0.00    0.00    0.00    0.00  100.00 0.00
>> 0.00    0.00    0.00
>> Average:      25    0.00    0.00    0.00    0.00    0.00  100.00 0.00
>> 0.00    0.00    0.00
>> Average:      26    0.00    0.00    0.00    0.00    0.00  100.00 0.00
>> 0.00    0.00    0.00
>> Average:      27    0.00    0.00    0.00    0.00    0.00  100.00 0.00
>> 0.00    0.00    0.00
>> Average:      28    0.00    0.00    0.04    0.00    0.00    0.00 0.00
>> 0.00    0.00   99.96
>> Average:      29    0.00    0.00    0.00    0.00    0.00    0.00 0.00
>> 0.00    0.00  100.00
>> Average:      30    0.00    0.00    0.00    0.00    0.00    0.00 0.00
>> 0.00    0.00  100.00
>> Average:      31    0.00    0.00    0.00    0.00    0.00    0.00 0.00
>> 0.00    0.00  100.00
>> Average:      32    0.00    0.00    0.00    0.00    0.00    0.00 0.00
>> 0.00    0.00  100.00
>> Average:      33    0.00    0.00    0.00    0.00    0.00    0.00 0.00
>> 0.00    0.00  100.00
>> Average:      34    0.00    0.00    0.00    0.00    0.00    0.00 0.00
>> 0.00    0.00  100.00
>> Average:      35    0.00    0.00    0.00    0.00    0.00    0.00 0.00
>> 0.00    0.00  100.00
>> Average:      36    0.00    0.00    0.00    0.00    0.00    0.00 0.00
>> 0.00    0.00  100.00
>> Average:      37    0.00    0.00    0.00    0.00    0.00    0.00 0.00
>> 0.00    0.00  100.00
>> Average:      38    0.00    0.00    0.00    0.00    0.00    0.00 0.00
>> 0.00    0.00  100.00
>> Average:      39    0.04    0.00    0.00    0.00    0.00    0.00 0.00
>> 0.00    0.00   99.96
>> Average:      40    0.00    0.00    0.00    0.00    0.00    0.00 0.00
>> 0.00    0.00  100.00
>> Average:      41    0.00    0.00    0.00    0.00    0.00    0.00 0.00
>> 0.00    0.00  100.00
>> Average:      42    0.00    0.00    0.00    0.00    0.00  100.00 0.00
>> 0.00    0.00    0.00
>> Average:      43    0.00    0.00    0.00    0.00    0.00  100.00 0.00
>> 0.00    0.00    0.00
>> Average:      44    0.00    0.00    0.04    0.17    0.00    0.00 0.00
>> 0.00    0.00   99.79
>> Average:      45    0.00    0.00    0.00    0.00    0.00    0.00 0.00
>> 0.00    0.00  100.00
>> Average:      46    0.00    0.00    0.00    0.00    0.00    0.00 0.00
>> 0.00    0.00  100.00
>> Average:      47    0.00    0.00    0.00    0.00    0.00    0.00 0.00
>> 0.00    0.00  100.00
>> Average:      48    0.00    0.00    0.00    0.00    0.00    0.00 0.00
>> 0.00    0.00  100.00
>> Average:      49    0.00    0.00    0.00    0.00    0.00    0.00 0.00
>> 0.00    0.00  100.00
>> Average:      50    0.00    0.00    0.00    0.00    0.00    0.00 0.00
>> 0.00    0.00  100.00
>> Average:      51    0.00    0.00    0.00    0.00    0.00    0.00 0.00
>> 0.00    0.00  100.00
>> Average:      52    0.00    0.00    0.00    0.00    0.00    0.00 0.00
>> 0.00    0.00  100.00
>> Average:      53    0.00    0.00    0.00    0.00    0.00    0.00 0.00
>> 0.00    0.00  100.00
>> Average:      54    0.00    0.00    0.00    0.00    0.00    0.00 0.00
>> 0.00    0.00  100.00
>> Average:      55    0.00    0.00    0.00    0.00    0.00    0.00 0.00
>> 0.00    0.00  100.00
>>
>> Average:     CPU    intr/s
>> Average:     all 123596.08
>> Average:       0    646.38
>> Average:       1    500.54
>> Average:       2    511.67
>> Average:       3    534.25
>> Average:       4    542.21
>> Average:       5    531.54
>> Average:       6    554.58
>> Average:       7    535.88
>> Average:       8    544.58
>> Average:       9    536.42
>> Average:      10    575.46
>> Average:      11    601.12
>> Average:      12    502.08
>> Average:      13    575.46
>> Average:      14   5917.92
>> Average:      15   5949.58
>> Average:      16   7021.29
>> Average:      17   7299.71
>> Average:      18   7391.67
>> Average:      19   7354.25
>> Average:      20   7543.42
>> Average:      21   7354.25
>> Average:      22   7322.33
>> Average:      23   7368.71
>> Average:      24   7429.00
>> Average:      25   7406.46
>> Average:      26   7400.67
>> Average:      27   7447.21
>> Average:      28    517.00
>> Average:      29    549.54
>> Average:      30    529.33
>> Average:      31    533.83
>> Average:      32    541.25
>> Average:      33    541.17
>> Average:      34    532.50
>> Average:      35    545.17
>> Average:      36    528.96
>> Average:      37    509.92
>> Average:      38    520.12
>> Average:      39    523.29
>> Average:      40    530.75
>> Average:      41    542.33
>> Average:      42   5921.71
>> Average:      43   5949.42
>> Average:      44    503.04
>> Average:      45    542.75
>> Average:      46    582.50
>> Average:      47    581.71
>> Average:      48    495.29
>> Average:      49    524.38
>> Average:      50    527.92
>> Average:      51    528.12
>> Average:      52    456.38
>> Average:      53    477.00
>> Average:      54    440.92
>> Average:      55    568.83
>>
>> Average:     CPU       HI/s    TIMER/s   NET_TX/s   NET_RX/s BLOCK/s
>> IRQ_POLL/s  TASKLET/s    SCHED/s  HRTIMER/s      RCU/s
>> Average:       0       0.00     250.00       0.17      87.00 0.00       0.00
>> 45.46     250.00       0.00      13.75
>> Average:       1       0.00     233.42       0.00       0.00 0.00       0.00
>> 0.00     249.92       0.00      17.21
>> Average:       2       0.00     249.04       0.00       0.00 0.00       0.00
>> 0.00     249.96       0.00      12.67
>> Average:       3       0.00     249.92       0.00       0.00 0.00       0.00
>> 0.00     249.92       0.00      34.42
>> Average:       4       0.00     248.67       0.17       0.00 0.00       0.00
>> 0.00     249.96       0.00      43.42
>> Average:       5       0.00     249.46       0.00       0.00 0.00       0.00
>> 0.00     249.92       0.00      32.17
>> Average:       6       0.00     249.79       0.00       0.00 0.00       0.00
>> 0.00     249.87       0.00      54.92
>> Average:       7       0.00     240.12       0.00       0.00 0.00       0.00
>> 0.00     249.96       0.00      45.79
>> Average:       8       0.00     247.42       0.00       0.00 0.00       0.00
>> 0.00     249.92       0.00      47.25
>> Average:       9       0.00     249.29       0.00       0.00 0.00       0.00
>> 0.00     249.96       0.00      37.17
>> Average:      10       0.00     248.75       0.00       0.00 0.00       0.00
>> 0.00     249.92       0.00      76.79
>> Average:      11       0.00     249.29       0.00       0.00 0.00       0.00
>> 42.79     249.83       0.00      59.21
>> Average:      12       0.00     249.83       0.00       0.00 0.00       0.00
>> 0.00     249.96       0.00       2.29
>> Average:      13       0.00     249.92       0.00       0.00 0.00       0.00
>> 0.00     249.92       0.00      75.62
>> Average:      14       0.00     148.21       0.17    5758.04 0.00       0.00
>> 0.00       8.42       0.00       3.08
>> Average:      15       0.00     148.42       0.46    5789.25 0.00       0.00
>> 0.00       8.33       0.00       3.12
>> Average:      16       0.00     142.62       0.79    6866.46 0.00       0.00
>> 0.00       8.29       0.00       3.12
>> Average:      17       0.00     143.17       0.42    7145.00 0.00       0.00
>> 0.00       8.08       0.00       3.04
>> Average:      18       0.00     153.62       0.42    7226.42 0.00       0.00
>> 0.00       8.04       0.00       3.17
>> Average:      19       0.00     150.46       0.46    7192.21 0.00       0.00
>> 0.00       8.04       0.00       3.08
>> Average:      20       0.00     145.21       0.17    7386.50 0.00       0.00
>> 0.00       8.29       0.00       3.25
>> Average:      21       0.00     150.96       0.46    7191.37 0.00       0.00
>> 0.00       8.25       0.00       3.21
>> Average:      22       0.00     146.67       0.54    7163.96 0.00       0.00
>> 0.00       8.04       0.00       3.12
>> Average:      23       0.00     151.38       0.42    7205.75 0.00       0.00
>> 0.00       8.00       0.00       3.17
>> Average:      24       0.00     153.33       0.17    7264.12 0.00       0.00
>> 0.00       8.08       0.00       3.29
>> Average:      25       0.00     153.21       0.17    7241.83 0.00       0.00
>> 0.00       7.96       0.00       3.29
>> Average:      26       0.00     153.96       0.17    7234.88 0.00       0.00
>> 0.00       8.38       0.00       3.29
>> Average:      27       0.00     151.71       0.79    7283.25 0.00       0.00
>> 0.00       8.04       0.00       3.42
>> Average:      28       0.00     245.71       0.00       0.00 0.00       0.00
>> 0.00     249.50       0.00      21.79
>> Average:      29       0.00     233.21       0.00       0.00 0.00       0.00
>> 0.00     249.87       0.00      66.46
>> Average:      30       0.00     248.92       0.00       0.00 0.00       0.00
>> 0.00     250.00       0.00      30.42
>> Average:      31       0.00     249.92       0.00       0.00 0.00       0.00
>> 0.00     249.96       0.00      33.96
>> Average:      32       0.00     248.67       0.00       0.00 0.00       0.00
>> 0.00     249.96       0.00      42.62
>> Average:      33       0.00     249.46       0.00       0.00 0.00       0.00
>> 0.00     249.92       0.00      41.79
>> Average:      34       0.00     249.79       0.00       0.00 0.00       0.00
>> 0.00     249.87       0.00      32.83
>> Average:      35       0.00     240.12       0.00       0.00 0.00       0.00
>> 0.00     249.96       0.00      55.08
>> Average:      36       0.00     247.42       0.00       0.00 0.00       0.00
>> 0.00     249.96       0.00      31.58
>> Average:      37       0.00     249.29       0.00       0.00 0.00       0.00
>> 0.00     249.92       0.00      10.71
>> Average:      38       0.00     248.75       0.00       0.00 0.00       0.00
>> 0.00     249.87       0.00      21.50
>> Average:      39       0.00     249.50       0.00       0.00 0.00       0.00
>> 0.00     249.83       0.00      23.96
>> Average:      40       0.00     249.83       0.00       0.00 0.00       0.00
>> 0.00     249.96       0.00      30.96
>> Average:      41       0.00     249.92       0.00       0.00 0.00       0.00
>> 0.00     249.92       0.00      42.50
>> Average:      42       0.00     148.38       0.71    5761.00 0.00       0.00
>> 0.00       8.25       0.00       3.38
>> Average:      43       0.00     147.21       0.50    5790.33 0.00       0.00
>> 0.00       8.00       0.00       3.38
>> Average:      44       0.00     248.96       0.00       0.00 0.00       0.00
>> 0.00     248.13       0.00       5.96
>> Average:      45       0.00     249.04       0.00       0.00 0.00       0.00
>> 0.00     248.88       0.00      44.83
>> Average:      46       0.00     248.96       0.00       0.00 0.00       0.00
>> 0.00     248.58       0.00      84.96
>> Average:      47       0.00     249.00       0.00       0.00 0.00       0.00
>> 0.00     248.75       0.00      83.96
>> Average:      48       0.00     249.12       0.00       0.00 0.00       0.00
>> 0.00     132.83       0.00     113.33
>> Average:      49       0.00     249.12       0.00       0.00 0.00       0.00
>> 0.00     248.62       0.00      26.62
>> Average:      50       0.00     248.92       0.00       0.00 0.00       0.00
>> 0.00     248.58       0.00      30.42
>> Average:      51       0.00     249.08       0.00       0.00 0.00       0.00
>> 0.00     248.42       0.00      30.63
>> Average:      52       0.00     249.21       0.00       0.00 0.00       0.00
>> 0.00     131.96       0.00      75.21
>> Average:      53       0.00     249.08       0.00       0.00 0.00       0.00
>> 0.00     136.12       0.00      91.79
>> Average:      54       0.00     249.00       0.00       0.00 0.00       0.00
>> 0.00     136.79       0.00      55.12
>> Average:      55       0.00     249.04       0.00       0.00 0.00       0.00
>> 0.00     248.71       0.00      71.08
>>
>>
>>
>