[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20170815112307.2dd366fe@redhat.com>
Date: Tue, 15 Aug 2017 11:23:07 +0200
From: Jesper Dangaard Brouer <brouer@...hat.com>
To: Paweł Staszewski <pstaszewski@...are.pl>
Cc: brouer@...hat.com,
Linux Kernel Network Developers <netdev@...r.kernel.org>,
Alexander Duyck <alexander.duyck@...il.com>,
Saeed Mahameed <saeedm@...lanox.com>,
Tariq Toukan <tariqt@...lanox.com>
Subject: Re: Kernel 4.13.0-rc4-next-20170811 - IP Routing / Forwarding
performance vs Core/RSS number / HT on
On Tue, 15 Aug 2017 02:38:56 +0200
Paweł Staszewski <pstaszewski@...are.pl> wrote:
> W dniu 2017-08-14 o 18:19, Jesper Dangaard Brouer pisze:
> > On Sun, 13 Aug 2017 18:58:58 +0200 Paweł Staszewski <pstaszewski@...are.pl> wrote:
> >
> >> To show some difference below comparision vlan/no-vlan traffic
> >>
> >> 10Mpps forwarded traffic vith no-vlan vs 6.9Mpps with vlan
> > I'm trying to reproduce in my testlab (with ixgbe). I do see, a
> > performance reduction of about 10-19% when I forward out a VLAN
> > interface. This is larger than I expected, but still lower than what
> > you reported 30-40% slowdown.
> >
> > [...]
> Ok mellanox afrrived (MT27700 - mlnx5 driver)
> And to compare melannox with vlans and without: 33% performance
> degradation (less than with ixgbe where i reach ~40% with same settings)
>
> Mellanox without TX traffix on vlan:
> ID;CPU_CORES / RSS QUEUES;PKT_SIZE;PPS_RX;BPS_RX;PPS_TX;BPS_TX
> 0;16;64;11089305;709715520;8871553;567779392
> 1;16;64;11096292;710162688;11095566;710116224
> 2;16;64;11095770;710129280;11096799;710195136
> 3;16;64;11097199;710220736;11097702;710252928
> 4;16;64;11080984;567081856;11079662;709098368
> 5;16;64;11077696;708972544;11077039;708930496
> 6;16;64;11082991;709311424;8864802;567347328
> 7;16;64;11089596;709734144;8870927;709789184
> 8;16;64;11094043;710018752;11095391;710105024
>
> Mellanox with TX traffic on vlan:
> ID;CPU_CORES / RSS QUEUES;PKT_SIZE;PPS_RX;BPS_RX;PPS_TX;BPS_TX
> 0;16;64;7369914;471674496;7370281;471697980
> 1;16;64;7368896;471609408;7368043;471554752
> 2;16;64;7367577;471524864;7367759;471536576
> 3;16;64;7368744;377305344;7369391;471641024
> 4;16;64;7366824;471476736;7364330;471237120
> 5;16;64;7368352;471574528;7367239;471503296
> 6;16;64;7367459;471517376;7367806;471539584
> 7;16;64;7367190;471500160;7367988;471551232
> 8;16;64;7368023;471553472;7368076;471556864
I wonder if the drivers page recycler is active/working or not, and if
the situation is different between VLAN vs no-vlan (given
page_frag_free is so high in you perf top). The Mellanox drivers
fortunately have a stats counter to tell us this explicitly (which the
ixgbe driver doesn't).
You can use my ethtool_stats.pl script watch these stats:
https://github.com/netoptimizer/network-testing/blob/master/bin/ethtool_stats.pl
(Hint perl dependency: dnf install perl-Time-HiRes)
> ethtool settings for both tests:
> ifc='enp175s0f0 enp175s0f1'
> for i in $ifc
> do
> ip link set up dev $i
> ethtool -A $i autoneg off rx off tx off
> ethtool -G $i rx 128 tx 256
The ring queue size recommendations, might be different for the mlx5
driver (Cc'ing Mellanox maintainers).
> ip link set $i txqueuelen 1000
> ethtool -C $i rx-usecs 25
> ethtool -L $i combined 16
> ethtool -K $i gro off tso off gso off sg on l2-fwd-offload off
> tx-nocache-copy off ntuple on
> ethtool -N $i rx-flow-hash udp4 sdfn
> done
Thanks for being explicit about what you setup is :-)
> and perf top:
> PerfTop: 83650 irqs/sec kernel:99.7% exact: 0.0% [4000Hz
> cycles], (all, 56 CPUs)
> ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
> 14.25% [kernel] [k] dst_release
> 14.17% [kernel] [k] skb_dst_force
> 13.41% [kernel] [k] rt_cache_valid
> 11.47% [kernel] [k] ip_finish_output2
> 7.01% [kernel] [k] do_raw_spin_lock
> 5.07% [kernel] [k] page_frag_free
> 3.47% [mlx5_core] [k] mlx5e_xmit
> 2.88% [kernel] [k] fib_table_lookup
> 2.43% [mlx5_core] [k] skb_from_cqe.isra.32
> 1.97% [kernel] [k] virt_to_head_page
> 1.81% [mlx5_core] [k] mlx5e_poll_tx_cq
> 0.93% [kernel] [k] __dev_queue_xmit
> 0.87% [kernel] [k] __build_skb
> 0.84% [kernel] [k] ipt_do_table
> 0.79% [kernel] [k] ip_rcv
> 0.79% [kernel] [k] acpi_processor_ffh_cstate_enter
> 0.78% [kernel] [k] netif_skb_features
> 0.73% [kernel] [k] __netif_receive_skb_core
> 0.52% [kernel] [k] dev_hard_start_xmit
> 0.52% [kernel] [k] build_skb
> 0.51% [kernel] [k] ip_route_input_rcu
> 0.50% [kernel] [k] skb_unref
> 0.49% [kernel] [k] ip_forward
> 0.48% [mlx5_core] [k] mlx5_cqwq_get_cqe
> 0.44% [kernel] [k] udp_v4_early_demux
> 0.41% [kernel] [k] napi_consume_skb
> 0.40% [kernel] [k] __local_bh_enable_ip
> 0.39% [kernel] [k] ip_rcv_finish
> 0.39% [kernel] [k] kmem_cache_alloc
> 0.38% [kernel] [k] sch_direct_xmit
> 0.33% [kernel] [k] validate_xmit_skb
> 0.32% [mlx5_core] [k] mlx5e_free_rx_wqe_reuse
> 0.29% [kernel] [k] netdev_pick_tx
> 0.28% [mlx5_core] [k] mlx5e_build_rx_skb
> 0.27% [kernel] [k] deliver_ptype_list_skb
> 0.26% [kernel] [k] fib_validate_source
> 0.26% [mlx5_core] [k] mlx5e_napi_poll
> 0.26% [mlx5_core] [k] mlx5e_handle_rx_cqe
> 0.26% [mlx5_core] [k] mlx5e_rx_cache_get
> 0.25% [kernel] [k] eth_header
> 0.23% [kernel] [k] skb_network_protocol
> 0.20% [kernel] [k] nf_hook_slow
> 0.20% [kernel] [k] vlan_passthru_hard_header
> 0.20% [kernel] [k] vlan_dev_hard_start_xmit
> 0.19% [kernel] [k] swiotlb_map_page
> 0.18% [kernel] [k] compound_head
> 0.18% [kernel] [k] neigh_connected_output
> 0.18% [mlx5_core] [k] mlx5e_alloc_rx_wqe
> 0.18% [kernel] [k] ip_output
> 0.17% [kernel] [k] prefetch_freepointer.isra.70
> 0.17% [kernel] [k] __slab_free
> 0.16% [kernel] [k] eth_type_vlan
> 0.16% [kernel] [k] ip_finish_output
> 0.15% [kernel] [k] kmem_cache_free_bulk
> 0.14% [kernel] [k] netif_receive_skb_internal
>
>
>
>
> wondering why this:
> 1.97% [kernel] [k] virt_to_head_page
> is in top...
This is related to the page_frag_free() call, but it is weird that it
shows up because it is suppose to be inlined (it is explicitly marked
inline in include/linux/mm.h).
> >>>>> perf top:
> >>>>>
> >>>>> PerfTop: 77835 irqs/sec kernel:99.7%
> >>>>> ---------------------------------------------
> >>>>>
> >>>>> 16.32% [kernel] [k] skb_dst_force
> >>>>> 16.30% [kernel] [k] dst_release
> >>>>> 15.11% [kernel] [k] rt_cache_valid
> >>>>> 12.62% [kernel] [k] ipv4_mtu
> >>>> It seems a little strange that these 4 functions are on the top
> > I don't see these in my test.
> >
> >>>>
> >>>>> 5.60% [kernel] [k] do_raw_spin_lock
> >>>> Why is calling/taking this lock? (Use perf call-graph recording).
> >>> can be hard to paste it here:)
> >>> attached file
> > The attached was very big. Please don't attach so big file on mailing
> > lists. Next time plase share them via e.g. pastebin. The output was a
> > capture from your terminal, which made the output more difficult to
> > read. Hint: You can/could use perf --stdio and place it in a file
> > instead.
> >
> > The output (extracted below) didn't show who called 'do_raw_spin_lock',
> > BUT it showed another interesting thing. The kernel code
> > __dev_queue_xmit() in might create route dst-cache problem for itself(?),
> > as it will first call skb_dst_force() and then skb_dst_drop() when the
> > packet is transmitted on a VLAN.
> >
> > static int __dev_queue_xmit(struct sk_buff *skb, void *accel_priv)
> > {
> > [...]
> > /* If device/qdisc don't need skb->dst, release it right now while
> > * its hot in this cpu cache.
> > */
> > if (dev->priv_flags & IFF_XMIT_DST_RELEASE)
> > skb_dst_drop(skb);
> > else
> > skb_dst_force(skb);
> >
> >
> >
> > Extracted part of attached perf output:
> >
> > --5.37%--ip_rcv_finish
> > |
> > |--4.02%--ip_forward
> > | |
> > | --3.92%--ip_forward_finish
> > | |
> > | --3.91%--ip_output
> > | |
> > | --3.90%--ip_finish_output
> > | |
> > | --3.88%--ip_finish_output2
> > | |
> > | --2.77%--neigh_connected_output
> > | |
> > | --2.74%--dev_queue_xmit
> > | |
> > | --2.73%--__dev_queue_xmit
> > | |
> > | |--1.66%--dev_hard_start_xmit
> > | | |
> > | | --1.64%--vlan_dev_hard_start_xmit
> > | | |
> > | | --1.63%--dev_queue_xmit
> > | | |
> > | | --1.62%--__dev_queue_xmit
> > | | |
> > | | |--0.99%--skb_dst_drop.isra.77
> > | | | |
> > | | | --0.99%--dst_release
> > | | |
> > | | --0.55%--sch_direct_xmit
> > | |
> > | --0.99%--skb_dst_force
> > |
> > --1.29%--ip_route_input_noref
> > |
> > --1.29%--ip_route_input_rcu
> > |
> > --1.05%--rt_cache_valid
> >
>
--
Best regards,
Jesper Dangaard Brouer
MSc.CS, Principal Kernel Engineer at Red Hat
LinkedIn: http://www.linkedin.com/in/brouer
Powered by blists - more mailing lists