[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <d11a2ce6ed394acd8c6da29d0358f7ce@AcuMS.aculab.com>
Date: Mon, 16 May 2022 12:58:35 +0000
From: David Laight <David.Laight@...LAB.COM>
To: "netdev@...r.kernel.org" <netdev@...r.kernel.org>
Subject: UDP receive performance drop since 3.10
I've noticed a doubling in the cpu cost of udp processing
between a RHEL 3.10 kernel and a 5.18-rc6 one.
This is (probably) all within ip_rcv().
I'm testing very high rate UDP receive of RTP audio.
(The target is 500000 udp/sec.)
I've enable RPS so that ip_rcv() runs on different multiple
cpus from the ethernet code.
(RSS on the BCM5720 (tg3) doesn't seem to work very well.)
On the 3.10 kernel the 'RPS' cpu show about 5% 'soft int' time.
With 5.10 this has doubled to 10% for much the same test.
The ftrace for a single packet shows a lot of extra code.
With a RHEL 3.10 kernel the trace is quite short:
/* netif_receive_skb: dev=em2 skbaddr=ffff99c3ee5ae000 len=200 */
ip_rcv() {
ip_rcv_finish() {
0.483 us udp_v4_early_demux();
ip_route_input_noref() {
ip_route_input_slow() {
2.036 us fib_table_lookup();
fib_validate_source() {
__fib_validate_source.isra.13() {
0.646 us fib_table_lookup();
1.589 us }
2.610 us }
6.820 us }
7.755 us }
ip_local_deliver() {
ip_local_deliver_finish() {
0.250 us raw_local_deliver();
udp_rcv() {
__udp4_lib_rcv() {
__udp4_lib_lookup() {
0.063 us compute_score();
0.097 us compute_score();
1.496 us }
udp_queue_rcv_skb() {
sk_filter_trim_cap() {
security_sock_rcv_skb() {
0.066 us cap_socket_sock_rcv_skb();
1.024 us }
1.836 us }
0.093 us ipv4_pktinfo_prepare();
__udp_enqueue_schedule_skb() {
0.066 us _raw_spin_lock();
sock_def_readable() {
__wake_up_sync_key() {
__wake_up_common_lock() {
0.194 us _raw_spin_lock_irqsave();
__wake_up_common() {
ep_poll_callback() {
0.184 us _raw_spin_lock_irqsave();
0.084 us _raw_spin_unlock_irqrestore();
2.009 us }
3.264 us }
0.087 us _raw_spin_unlock_irqrestore();
5.579 us }
6.311 us }
7.241 us }
8.833 us }
+ 12.948 us }
+ 16.365 us }
+ 17.280 us }
+ 19.900 us }
+ 20.673 us }
+ 31.519 us }
+ 32.534 us }
Whereas 5.18 has a much longer trace:
ip_rcv() {
0.668 us ip_rcv_core();
ip_rcv_finish_core.constprop.0() {
1.155 us udp_v4_early_demux();
ip_route_input_noref() {
0.306 us __rcu_read_lock();
ip_route_input_rcu() {
ip_route_input_slow() {
make_kuid() {
0.441 us map_id_range_down();
1.231 us }
0.307 us __rcu_read_lock();
fib_table_lookup() {
1.268 us fib_lookup_good_nhc();
2.736 us }
0.307 us __rcu_read_unlock();
fib_validate_source() {
__fib_validate_source() {
make_kuid() {
0.304 us map_id_range_down();
0.931 us }
0.304 us __rcu_read_lock();
fib_table_lookup() {
0.493 us fib_lookup_good_nhc();
1.405 us }
0.393 us __rcu_read_unlock();
0.390 us fib_info_nh_uses_dev();
5.457 us }
6.327 us }
+ 13.726 us }
+ 14.727 us }
0.407 us __rcu_read_unlock();
+ 16.673 us }
+ 19.389 us }
ip_local_deliver() {
ip_local_deliver_finish() {
0.376 us __rcu_read_lock();
ip_protocol_deliver_rcu() {
0.434 us raw_local_deliver();
udp_rcv() {
__udp4_lib_rcv() {
__udp4_lib_lookup() {
0.326 us udp4_lib_lookup2.isra.0();
0.928 us udp4_lib_lookup2.isra.0();
2.413 us }
udp_unicast_rcv_skb() {
udp_queue_rcv_skb() {
udp_queue_rcv_one_skb() {
sk_filter_trim_cap() {
0.440 us security_sock_rcv_skb();
sk_filter_trim_cap.part.0() {
0.297 us __rcu_read_lock();
0.310 us __rcu_read_unlock();
1.531 us }
3.277 us }
0.334 us skb_pull_rcsum();
0.310 us ipv4_pktinfo_prepare();
__udp_enqueue_schedule_skb() {
_raw_spin_lock() {
0.334 us preempt_count_add();
0.938 us }
_raw_spin_unlock() {
0.303 us preempt_count_sub();
0.908 us }
sock_def_readable() {
0.307 us __rcu_read_lock();
__wake_up_sync_key() {
__wake_up_common_lock() {
_raw_spin_lock_irqsave() {
0.326 us preempt_count_add();
0.951 us }
__wake_up_common() {
ep_poll_callback() {
_raw_read_lock_irqsave() {
0.330 us preempt_count_add();
1.152 us }
0.614 us __rcu_read_lock();
0.323 us __rcu_read_unlock();
_raw_read_unlock_irqrestore() {
0.380 us preempt_count_sub();
0.995 us }
5.410 us }
6.741 us }
_raw_spin_unlock_irqrestore() {
0.317 us preempt_count_sub();
1.094 us }
9.994 us }
+ 10.806 us }
0.327 us __rcu_read_unlock();
+ 12.809 us }
+ 16.182 us }
+ 22.153 us }
+ 22.769 us }
+ 23.528 us }
+ 27.878 us }
+ 28.646 us }
+ 30.476 us }
0.324 us __rcu_read_unlock();
+ 32.398 us }
+ 33.168 us }
+ 54.976 us }
Now I know the cost of ftrace is significant (and seems to be
higher in 5.18) but there also seems to be a lot more code.
As well as the extra rcu locks (which are probably mostly ftrace
overhead, a few other things stick out:
1) The sock_net_uid(net, NULL) calls.
These are make_kuid(net->user_ns, 0) - so pretty much constant.
They seem to end up in a loop in map_id_range_down_base().
All looks expensive in the default network namespace where
0 maps to 0.
2) Extra code in fib_lookup().
3) A lot more locking in ep_poll_callback().
The 5.18 kernel also seems to have CONFIG_DEBUG_PREEMPT set.
I can't find the Kconfig entry for it.
It doesn't exist in the old .config at all.
So I'm not sure why 'make oldconfig' picked it up.
The other possibility is that the extra code is tick_nohz_idle_exit().
The 3.10 trace is from a non-RPS config so I can't compare it.
I'm going to disable CONFIG_DEBUG_PREEMPT to see how much
difference it makes.
Any idea if any other debug options will have got picked up?
David
-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)
Powered by blists - more mailing lists