[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <20201104015834.mcn2eoibxf6j3ksw@skbuf>
Date: Wed, 4 Nov 2020 03:58:34 +0200
From: Vladimir Oltean <olteanv@...il.com>
To: netdev@...r.kernel.org
Subject: DSA and ptp_classify_raw: saving some CPU cycles causes worse
throughput?
Hi,
I was testing a simple patch:
diff --git a/net/dsa/slave.c b/net/dsa/slave.c
index c6806eef906f..e0cda3a65f28 100644
--- a/net/dsa/slave.c
+++ b/net/dsa/slave.c
@@ -511,6 +511,9 @@ static void dsa_skb_tx_timestamp(struct dsa_slave_priv *p,
struct sk_buff *clone;
unsigned int type;
+ if (likely(!(skb_shinfo(skb)->tx_flags & SKBTX_HW_TSTAMP)))
+ return;
+
type = ptp_classify_raw(skb);
if (type == PTP_CLASS_NONE)
return;
which had the following reasoning behind it: we can avoid
ptp_classify_raw when TX timestamping is not requested.
The point here is that ptp_classify_raw should be the most expensive
when it's not looking at a PTP frame:
-> test_ipv4
-> test_ipv6
-> test_8021q
-> test_ieee1588
because only then would we know for sure that it's not a PTP frame.
But since we should anyway not do TX timestamping unless the flag
skb_shinfo(skb)->tx_flags & SKBTX_HW_TSTAMP is set, then let's just look
at that directly first. This saves 3 checks for each normal frame.
In theory all is ok, and here is some perf data taken on a board. Only
the consumers > 1% of CPU cycles are shown. The ptp_classify_raw is
visible via ___bpf_prog_run. There is a ptp_classify_raw called on TX
and another one called on RX. This patch just gets rid of the one on TX.
So ___bpf_prog_run goes from being the #2 consumer, at 4.93% CPU cycles,
to basically disappearing when there is mostly only TX traffic on the
interface.
$ perf record -e cycles iperf3 -c 192.168.1.2 -t 100
Before:
Overhead Command Shared Object Symbol
........ ....... ................. ......................................
20.23% iperf3 [kernel.kallsyms] [k] csum_partial_copy_nocheck
15.36% iperf3 [kernel.kallsyms] [k] arm_copy_from_user
4.93% iperf3 [kernel.kallsyms] [k] ___bpf_prog_run
3.09% iperf3 [kernel.kallsyms] [k] __kmalloc_track_caller
3.07% iperf3 [kernel.kallsyms] [k] kmem_cache_alloc
3.01% iperf3 [kernel.kallsyms] [k] skb_segment
2.33% iperf3 [kernel.kallsyms] [k] __dev_queue_xmit
2.22% iperf3 [kernel.kallsyms] [k] skb_copy_and_csum_bits
1.83% iperf3 [kernel.kallsyms] [k] pfifo_fast_dequeue
1.78% iperf3 [kernel.kallsyms] [k] sja1105_xmit
1.71% iperf3 [kernel.kallsyms] [k] dev_hard_start_xmit
1.71% iperf3 [kernel.kallsyms] [k] gfar_start_xmit
1.66% iperf3 [kernel.kallsyms] [k] tcp_gso_segment
1.50% iperf3 [kernel.kallsyms] [k] mmiocpy
1.46% iperf3 [kernel.kallsyms] [k] __qdisc_run
1.35% iperf3 [kernel.kallsyms] [k] sch_direct_xmit
1.34% iperf3 [kernel.kallsyms] [k] pfifo_fast_enqueue
1.28% iperf3 [kernel.kallsyms] [k] mmioset
1.27% iperf3 [kernel.kallsyms] [k] tcp_ack
1.10% iperf3 [kernel.kallsyms] [k] __alloc_skb
1.07% iperf3 [kernel.kallsyms] [k] dsa_slave_xmit
After:
Overhead Command Shared Object Symbol
........ ....... ................. ......................................
20.37% iperf3 [kernel.kallsyms] [k] csum_partial_copy_nocheck
17.84% iperf3 [kernel.kallsyms] [k] arm_copy_from_user
3.06% iperf3 [kernel.kallsyms] [k] kmem_cache_alloc
3.01% iperf3 [kernel.kallsyms] [k] __kmalloc_track_caller
2.99% iperf3 [kernel.kallsyms] [k] skb_segment
2.29% iperf3 [kernel.kallsyms] [k] __dev_queue_xmit
2.28% iperf3 [kernel.kallsyms] [k] skb_copy_and_csum_bits
1.92% iperf3 [kernel.kallsyms] [k] sja1105_xmit
1.69% iperf3 [kernel.kallsyms] [k] tcp_gso_segment
1.64% iperf3 [kernel.kallsyms] [k] pfifo_fast_dequeue
1.61% iperf3 [kernel.kallsyms] [k] gfar_start_xmit
1.51% iperf3 [kernel.kallsyms] [k] dev_hard_start_xmit
1.50% iperf3 [kernel.kallsyms] [k] __qdisc_run
1.48% iperf3 [kernel.kallsyms] [k] mmiocpy
1.42% iperf3 [kernel.kallsyms] [k] tcp_ack
1.34% iperf3 [kernel.kallsyms] [k] pfifo_fast_enqueue
1.31% iperf3 [kernel.kallsyms] [k] mmioset
1.23% iperf3 [kernel.kallsyms] [k] sch_direct_xmit
1.17% iperf3 [kernel.kallsyms] [k] dsa_slave_xmit
1.13% iperf3 [kernel.kallsyms] [k] __alloc_skb
The only problem?
Throughput is actually a few Mbps worse, and this is 100% reproducible,
doesn't appear to be measurement error.
Before:
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval Transfer Bitrate Retr
[ 5] 0.00-100.00 sec 10.9 GBytes 935 Mbits/sec 0 sender
[ 5] 0.00-100.01 sec 10.9 GBytes 935 Mbits/sec receiver
After:
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval Transfer Bitrate Retr
[ 5] 0.00-100.00 sec 10.8 GBytes 926 Mbits/sec 0 sender
[ 5] 0.00-100.00 sec 10.8 GBytes 926 Mbits/sec receiver
I have tried to study the cache misses and branch misses in the good and
bad case, but I am not seeing any obvious sign there:
Before:
# Samples: 339K of event 'cache-misses'
# Event count (approx.): 647900762
#
# Overhead Samples Command Shared Object Symbol
# ........ ............ ....... ................. ......................................
#
30.40% 102932 iperf3 [kernel.kallsyms] [k] arm_copy_from_user
27.70% 93906 iperf3 [kernel.kallsyms] [k] csum_partial_copy_nocheck
3.22% 10923 iperf3 [kernel.kallsyms] [k] kmem_cache_alloc
3.21% 10886 iperf3 [kernel.kallsyms] [k] __kmalloc_track_caller
2.82% 9711 iperf3 [kernel.kallsyms] [k] tcp_gso_segment
2.28% 7718 iperf3 [kernel.kallsyms] [k] skb_segment
2.20% 7507 iperf3 [kernel.kallsyms] [k] ___bpf_prog_run
1.81% 6124 iperf3 [kernel.kallsyms] [k] mmioset
1.66% 5706 iperf3 [kernel.kallsyms] [k] inet_gso_segment
1.17% 4028 iperf3 [kernel.kallsyms] [k] ip_send_check
1.13% 3863 iperf3 [kernel.kallsyms] [k] sja1105_xmit
1.12% 3801 iperf3 [kernel.kallsyms] [k] __ksize
1.12% 3783 iperf3 [kernel.kallsyms] [k] skb_copy_and_csum_bits
1.08% 3723 iperf3 [kernel.kallsyms] [k] csum_partial
0.98% 3351 iperf3 [kernel.kallsyms] [k] netdev_core_pick_tx
# Samples: 326K of event 'branch-misses'
# Event count (approx.): 527179670
#
# Overhead Samples Command Shared Object Symbol
# ........ ............ ....... ................. ....................................
#
10.67% 34836 iperf3 [kernel.kallsyms] [k] pfifo_fast_dequeue
7.09% 23325 iperf3 [kernel.kallsyms] [k] __kmalloc_track_caller
5.91% 19296 iperf3 [kernel.kallsyms] [k] gfar_start_xmit
4.94% 16183 iperf3 [kernel.kallsyms] [k] __dev_queue_xmit
3.75% 12241 iperf3 [kernel.kallsyms] [k] _raw_spin_lock
3.67% 11990 iperf3 [kernel.kallsyms] [k] __qdisc_run
3.58% 11791 iperf3 [kernel.kallsyms] [k] skb_copy_and_csum_bits
3.53% 11563 iperf3 [kernel.kallsyms] [k] sch_direct_xmit
3.51% 11477 iperf3 [kernel.kallsyms] [k] dsa_slave_xmit
3.35% 11014 iperf3 [kernel.kallsyms] [k] mmioset
3.30% 10854 iperf3 [kernel.kallsyms] [k] kmem_cache_alloc
2.25% 7361 iperf3 [kernel.kallsyms] [k] ___bpf_prog_run
2.16% 7075 iperf3 [kernel.kallsyms] [k] dma_map_page_attrs
2.09% 6837 iperf3 [kernel.kallsyms] [k] dev_hard_start_xmit
1.77% 5700 iperf3 [kernel.kallsyms] [k] tcp_ack
1.58% 5143 iperf3 [kernel.kallsyms] [k] validate_xmit_skb.constprop.54
1.56% 5143 iperf3 [kernel.kallsyms] [k] __copy_skb_header
1.45% 4757 iperf3 [kernel.kallsyms] [k] sja1105_xmit
1.36% 4483 iperf3 [kernel.kallsyms] [k] mmiocpy
1.30% 4277 iperf3 [kernel.kallsyms] [k] csum_partial_copy_nocheck
1.27% 4160 iperf3 [kernel.kallsyms] [k] netif_skb_features
1.20% 3897 iperf3 [kernel.kallsyms] [k] get_page_from_freelist
1.15% 3737 iperf3 [kernel.kallsyms] [k] tcp_sendmsg_locked
1.13% 3693 iperf3 [kernel.kallsyms] [k] is_vmalloc_addr
1.02% 3306 iperf3 [kernel.kallsyms] [k] arm_copy_from_user
0.98% 3202 iperf3 [kernel.kallsyms] [k] pfifo_fast_enqueue
After:
# Samples: 290K of event 'cache-misses'
# Event count (approx.): 586163914
#
# Overhead Samples Command Shared Object Symbol
# ........ ............ ....... ................. .....................................
#
32.97% 94339 iperf3 [kernel.kallsyms] [k] arm_copy_from_user
26.46% 77278 iperf3 [kernel.kallsyms] [k] csum_partial_copy_nocheck
3.17% 9246 iperf3 [kernel.kallsyms] [k] kmem_cache_alloc
3.03% 8838 iperf3 [kernel.kallsyms] [k] __kmalloc_track_caller
2.46% 7258 iperf3 [kernel.kallsyms] [k] tcp_gso_segment
2.21% 6435 iperf3 [kernel.kallsyms] [k] skb_segment
1.87% 5518 iperf3 [kernel.kallsyms] [k] sja1105_xmit
1.72% 5021 iperf3 [kernel.kallsyms] [k] mmioset
1.51% 4449 iperf3 [kernel.kallsyms] [k] inet_gso_segment
1.06% 3104 iperf3 [kernel.kallsyms] [k] __ksize
1.05% 3057 iperf3 [kernel.kallsyms] [k] skb_copy_and_csum_bits
1.04% 3073 iperf3 [kernel.kallsyms] [k] ip_send_check
0.96% 2761 iperf3 [kernel.kallsyms] [k] tcp_sendmsg_locked
0.96% 2836 iperf3 [kernel.kallsyms] [k] csum_partial
0.96% 2824 iperf3 [kernel.kallsyms] [k] dsa_8021q_xmit
# Samples: 266K of event 'branch-misses'
# Event count (approx.): 370809162
#
# Overhead Samples Command Shared Object Symbol
# ........ ............ ....... ................. .....................................
#
8.88% 25491 iperf3 [kernel.kallsyms] [k] __kmalloc_track_caller
5.16% 12016 iperf3 [kernel.kallsyms] [k] pfifo_fast_dequeue
4.86% 13401 iperf3 [kernel.kallsyms] [k] __dev_queue_xmit
4.32% 12298 iperf3 [kernel.kallsyms] [k] skb_copy_and_csum_bits
4.28% 12167 iperf3 [kernel.kallsyms] [k] mmioset
4.05% 11561 iperf3 [kernel.kallsyms] [k] kmem_cache_alloc
3.90% 10864 iperf3 [kernel.kallsyms] [k] __qdisc_run
3.19% 7854 iperf3 [kernel.kallsyms] [k] gfar_start_xmit
2.91% 7817 iperf3 [kernel.kallsyms] [k] dsa_slave_xmit
2.42% 6418 iperf3 [kernel.kallsyms] [k] dev_hard_start_xmit
2.08% 6087 iperf3 [kernel.kallsyms] [k] mmiocpy
2.05% 5786 iperf3 [kernel.kallsyms] [k] tcp_ack
1.87% 4610 iperf3 [kernel.kallsyms] [k] sch_direct_xmit
1.76% 5112 iperf3 [kernel.kallsyms] [k] __copy_skb_header
1.67% 4779 iperf3 [kernel.kallsyms] [k] csum_partial_copy_nocheck
1.60% 3862 iperf3 [kernel.kallsyms] [k] sja1105_xmit
1.52% 4593 iperf3 [kernel.kallsyms] [k] get_page_from_freelist
1.39% 4003 iperf3 [kernel.kallsyms] [k] arm_copy_from_user
1.31% 3881 iperf3 [kernel.kallsyms] [k] tcp_sendmsg_locked
1.26% 3183 iperf3 [kernel.kallsyms] [k] _raw_spin_lock
1.21% 2899 iperf3 [kernel.kallsyms] [k] dma_map_page_attrs
1.10% 3111 iperf3 [kernel.kallsyms] [k] __free_pages_ok
1.06% 3037 iperf3 [kernel.kallsyms] [k] tcp_write_xmit
0.95% 2667 iperf3 [kernel.kallsyms] [k] skb_segment
My untrained eye tells me that in the 'after patch' case (the worse
one), there are less branch misses, and less cache misses. So by all
perf metrics, the throughput should be better, but it isn't. What gives?
Powered by blists - more mailing lists