[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20181110230630.0daeba8e@redhat.com>
Date: Sat, 10 Nov 2018 23:06:30 +0100
From: Jesper Dangaard Brouer <brouer@...hat.com>
To: Paweł Staszewski <pstaszewski@...are.pl>
Cc: Saeed Mahameed <saeedm@...lanox.com>,
"netdev@...r.kernel.org" <netdev@...r.kernel.org>,
brouer@...hat.com
Subject: Re: Kernel 4.19 network performance - forwarding/routing normal
users traffic
On Sat, 10 Nov 2018 20:56:02 +0100
Paweł Staszewski <pstaszewski@...are.pl> wrote:
> W dniu 10.11.2018 o 20:49, Paweł Staszewski pisze:
> >
> >
> > W dniu 10.11.2018 o 20:34, Jesper Dangaard Brouer pisze:
> >> On Fri, 9 Nov 2018 23:20:38 +0100 Paweł Staszewski
> >> <pstaszewski@...are.pl> wrote:
> >>
> >>> W dniu 08.11.2018 o 20:12, Paweł Staszewski pisze:
> >>>> CPU load is lower than for connectx4 - but it looks like bandwidth
> >>>> limit is the same :)
> >>>> But also after reaching 60Gbit/60Gbit
> >>>>
> >>>> bwm-ng v0.6.1 (probing every 1.000s), press 'h' for help
> >>>> input: /proc/net/dev type: rate
> >>>> - iface Rx Tx Total
> >>>> ===================================================================
> >>>>
> >>>>
> >>>> enp175s0: 45.09 Gb/s 15.09 Gb/s 60.18 Gb/s
> >>>> enp216s0: 15.14 Gb/s 45.19 Gb/s 60.33 Gb/s
> >>>> -------------------------------------------------------------------
> >>>>
> >>>>
> >>>> total: 60.45 Gb/s 60.48 Gb/s 120.93 Gb/s
> >>> Today reached 65/65Gbit/s
> >>>
> >>> But starting from 60Gbit/s RX / 60Gbit TX nics start to drop packets
> >>> (with 50%CPU on all 28cores) - so still there is cpu power to use :).
> >> This is weird!
> >>
> >> How do you see / measure these drops?
> >
> > Simple icmp test like ping -i 0.1
> > And im testing by icmp management ip address on vlan that is attacked
> > to one NIC (the side that is more stressed with RX)
> > And another icmp test is forward thru this router - host behind it
> >
> > Both measurements shows same loss ratio from 0.1 to 0.5% after
> > reaching ~45Gbit/s RX side - depends how much RX side is pushed drops
> > vary between 0.1 to 0.5 - even 0.6%:)
> >
Okay good to know, you use an external measurement for this. I do
think packets are getting dropped by the NIC.
> >>> So checked other stats.
> >>> softnet_stats shows average 1k squeezed per sec:
> >> Is below output the raw counters? not per sec?
> >>
> >> It would be valuable to see the per sec stats instead...
> >> I use this tool:
> >> https://github.com/netoptimizer/network-testing/blob/master/bin/softnet_stat.pl
> CPU total/sec dropped/sec squeezed/sec collision/sec rx_rps/sec flow_limit/sec
> CPU:00 0 0 0 0 0 0
[...]
> CPU:13 0 0 0 0 0 0
> CPU:14 485538 0 43 0 0 0
> CPU:15 474794 0 51 0 0 0
> CPU:16 449322 0 41 0 0 0
> CPU:17 476420 0 46 0 0 0
> CPU:18 440436 0 38 0 0 0
> CPU:19 501499 0 49 0 0 0
> CPU:20 459468 0 49 0 0 0
> CPU:21 438928 0 47 0 0 0
> CPU:22 468983 0 40 0 0 0
> CPU:23 446253 0 47 0 0 0
> CPU:24 451909 0 46 0 0 0
> CPU:25 479373 0 55 0 0 0
> CPU:26 467848 0 49 0 0 0
> CPU:27 453153 0 51 0 0 0
> CPU:28 0 0 0 0 0 0
[...]
> CPU:40 0 0 0 0 0 0
> CPU:41 0 0 0 0 0 0
> CPU:42 466853 0 43 0 0 0
> CPU:43 453059 0 54 0 0 0
> CPU:44 363219 0 34 0 0 0
> CPU:45 353632 0 38 0 0 0
> CPU:46 371618 0 40 0 0 0
> CPU:47 350518 0 46 0 0 0
> CPU:48 397544 0 40 0 0 0
> CPU:49 364873 0 38 0 0 0
> CPU:50 383630 0 38 0 0 0
> CPU:51 358771 0 39 0 0 0
> CPU:52 372547 0 38 0 0 0
> CPU:53 372882 0 36 0 0 0
> CPU:54 366244 0 43 0 0 0
> CPU:55 365886 0 39 0 0 0
>
> Summed: 11835201 0 1217 0 0 0
Do notice, the per CPU squeeze is not too large.
The summed 11.8 Mpps is a little high compared to:
Ethtool(enp216s0) stat: 4971677 (4,971,677) <= rx_packets /sec
Ethtool(enp175s0) stat: 3717148 (3,717,148) <= rx_packets /sec
Sum: 3717148+4971677 = 8688825 (8,688,825)
[...]
> >>>
> >>> Remember those tests are now on two separate connectx5 connected to
> >>> two separate pcie x16 gen 3.0
> >> That is strange... I still suspect some HW NIC issue, can you provide
> >> ethtool stats info via tool:
> >>
> >> https://github.com/netoptimizer/network-testing/blob/master/bin/ethtool_stats.pl
> >>
> >> $ ethtool_stats.pl --dev enp175s0 --dev enp216s0
> >>
> >> The tool remove zero-stats counters and report per sec stats. It makes
> >> it easier to spot that is relevant for the given workload.
> > yes mlnx have just too many counters that are always 0 for my case :)
> > Will try this also
> >
> But still alot of non 0 counters
> Show adapter(s) (enp175s0 enp216s0) statistics (ONLY that changed!)
> Ethtool(enp175s0) stat: 8891 ( 8,891) <= ch0_arm /sec
[...]
I have copied the stats over in another document so I can better looks
at it... and I've found some interesting stats.
E.g. we can see that the NIC hardware is dropping packets.
RX-drops on enp175s0:
(enp175s0) stat: 4850734036 ( 4,850,734,036) <= rx_bytes /sec
(enp175s0) stat: 5069043007 ( 5,069,043,007) <= rx_bytes_phy /sec
-218308971 ( -218,308,971) Dropped bytes /sec
(enp175s0) stat: 139602 ( 139,602) <= rx_discards_phy /sec
(enp175s0) stat: 3717148 ( 3,717,148) <= rx_packets /sec
(enp175s0) stat: 3862420 ( 3,862,420) <= rx_packets_phy /sec
-145272 ( -145,272) Dropped packets /sec
RX-drops on enp216s0 is less:
(enp216s0) stat: 2592286809 ( 2,592,286,809) <= rx_bytes /sec
(enp216s0) stat: 2633575771 ( 2,633,575,771) <= rx_bytes_phy /sec
-41288962 ( -41,288,962) Dropped bytes /sec
(enp216s0) stat: 464 (464) <= rx_discards_phy /sec
(enp216s0) stat: 4971677 ( 4,971,677) <= rx_packets /sec
(enp216s0) stat: 4975563 ( 4,975,563) <= rx_packets_phy /sec
-3886 ( -3,886) Dropped packets /sec
--
Best regards,
Jesper Dangaard Brouer
MSc.CS, Principal Kernel Engineer at Red Hat
LinkedIn: http://www.linkedin.com/in/brouer
Powered by blists - more mailing lists