[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20081117110119.GL28786@elte.hu>
Date: Mon, 17 Nov 2008 12:01:19 +0100
From: Ingo Molnar <mingo@...e.hu>
To: David Miller <davem@...emloft.net>
Cc: rjw@...k.pl, linux-kernel@...r.kernel.org,
kernel-testers@...r.kernel.org, cl@...ux-foundation.org,
efault@....de, a.p.zijlstra@...llo.nl,
Linus Torvalds <torvalds@...ux-foundation.org>
Subject: Re: [Bug #11308] tbench regression on each kernel release from
2.6.22 -> 2.6.28
* David Miller <davem@...emloft.net> wrote:
> From: Ingo Molnar <mingo@...e.hu>
> Date: Mon, 17 Nov 2008 10:06:48 +0100
>
> >
> > * Rafael J. Wysocki <rjw@...k.pl> wrote:
> >
> > > This message has been generated automatically as a part of a report
> > > of regressions introduced between 2.6.26 and 2.6.27.
> > >
> > > The following bug entry is on the current list of known regressions
> > > introduced between 2.6.26 and 2.6.27. Please verify if it still should
> > > be listed and let me know (either way).
> > >
> > >
> > > Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11308
> > > Subject : tbench regression on each kernel release from 2.6.22 -> 2.6.28
> > > Submitter : Christoph Lameter <cl@...ux-foundation.org>
> > > Date : 2008-08-11 18:36 (98 days old)
> > > References : http://marc.info/?l=linux-kernel&m=121847986119495&w=4
> > > http://marc.info/?l=linux-kernel&m=122125737421332&w=4
> >
> > Christoph, as per the recent analysis of Mike:
> >
> > http://fixunix.com/kernel/556867-regression-benchmark-throughput-loss-a622cf6-f7160c7-pull.html
> >
> > all scheduler components of this regression have been eliminated.
> >
> > In fact his numbers show that scheduler speedups since 2.6.22 have
> > offset and hidden most other sources of tbench regression. (i.e. the
> > scheduler portion got 5% faster, hence it was able to offset a
> > slowdown of 5% in other areas of the kernel that tbench triggers)
>
> Although I respect the improvements, wake_up() is still several
> orders of magnitude slower than it was in 2.6.22 and wake_up() is at
> the top of the profiles in tbench runs.
hm, several orders of magnitude slower? That contradicts Mike's
numbers and my own numbers and profiles as well: see below.
The scheduler's overhead barely even registers on a 16-way x86 system
i'm running tbench on. Here's the NMI profile during 64 threads tbench
on a 16-way x86 box with an v2.6.28-rc5 kernel [config attached]:
Throughput 3437.65 MB/sec 64 procs
==================================
21570252 total
........
1494803 copy_user_generic_string
998232 sock_rfree
491471 tcp_ack
482405 ip_dont_fragment
470685 ip_local_deliver
436325 constant_test_bit [ called by napi_disable_pending() ]
375469 avc_has_perm_noaudit
347663 tcp_sendmsg
310383 tcp_recvmsg
300412 __inet_lookup_established
294377 system_call
286603 tcp_transmit_skb
251782 selinux_ip_postroute
236028 tcp_current_mss
235631 schedule
234013 netif_rx
229854 _local_bh_enable_ip
219501 tcp_v4_rcv
[ etc. - see full profile attached further below ]
Note that the scheduler does not even show up in the profile up to
entry #15!
I've also summarized NMI profiler output by major subsystems:
NET overhead (12603450/21570252): 58.43%
security overhead ( 1903598/21570252): 8.83%
usercopy overhead ( 1753617/21570252): 8.13%
sched overhead ( 1599406/21570252): 7.41%
syscall overhead ( 560487/21570252): 2.60%
IRQ overhead ( 555439/21570252): 2.58%
slab overhead ( 492421/21570252): 2.28%
timer overhead ( 226573/21570252): 1.05%
pagealloc overhead ( 192681/21570252): 0.89%
PID overhead ( 115123/21570252): 0.53%
VFS overhead ( 107926/21570252): 0.50%
pagecache overhead ( 62552/21570252): 0.29%
gtod overhead ( 38651/21570252): 0.18%
IDLE overhead ( 0/21570252): 0.00%
---------------------------------------------------------
left ( 1349494/21570252): 6.26%
The scheduler's functions are absolutely flat, and consistent with an
extreme context-switching rate of 1.35 million per second. The
scheduler can go up to about 20 million context switches per second on
this system:
procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu------
r b swpd free buff cache si so bi bo in cs us sy id wa st
32 0 0 32229696 29308 649880 0 0 0 0 164135 20026853 24 76 0 0 0
32 0 0 32229752 29308 649880 0 0 0 0 164203 20032770 24 76 0 0 0
32 0 0 32229752 29308 649880 0 0 0 0 164201 20036492 25 75 0 0 0
... and 7% scheduling overhead is roughly consistent with 1.35/20.0.
Wake up affinities and data flow caching is just fine in this workload
- we've got scheduler statistics for that and they look good too.
It all looks like pure old-fashioned straight overhead in the
networking layer to me. Do we still touch the same global cacheline
for every localhost packet we process? Anything like that would show
up big time.
Anyway, in terms of scheduling there's absolutely nothing anomalous i
can see about this workload. Scheduling looks healthy throughout - and
the few things we noticed causing unnecessary overhead are now fixed
in -rc5. (but it's all in the <5% range of impact of total scheduling
overhead - i.e. in the 0.4% absolute range in this workload)
And the thing is, the scheduler's task in this workload is by far the
most difficult one conceptually: it has to manage and optimize
concurrency of _future_ processing, with an event frequency that is
_WAY_ out of the normal patterns: more than 1.3 million context
switches per second (!). It also switches to/from completely
independent contexts of computing, with the all the implications that
this brings.
Networking and VFS "just" has to shuffle around bits in memory along a
very specific plan given to it by user-space. That plan is
well-specified and goes along the lines of: "copy this (already
cached) file content to that socket" and back.
By the raw throughput figures the system is pushing a couple of
million data packets per second.
Still we spend 7 times more CPU time in the networking code than in
the scheduler or in the user-copy code. Why?
Ingo
------------------------->
21570252 total
........
1494803 copy_user_generic_string
998232 sock_rfree
491471 tcp_ack
482405 ip_dont_fragment
470685 ip_local_deliver
436325 constant_test_bit
375469 avc_has_perm_noaudit
347663 tcp_sendmsg
310383 tcp_recvmsg
300412 __inet_lookup_established
294377 system_call
286603 tcp_transmit_skb
251782 selinux_ip_postroute
236028 tcp_current_mss
235631 schedule
234013 netif_rx
229854 _local_bh_enable_ip
219501 tcp_v4_rcv
210046 netlbl_enabled
205022 constant_test_bit
199598 skb_release_head_state
187952 ip_queue_xmit
178779 tcp_established_options
175955 dev_queue_xmit
169904 netif_receive_skb
166629 ip_finish_output2
162291 sysret_check
151262 __switch_to
143355 audit_syscall_entry
142694 load_cr3
136571 memset_c
136115 nf_hook_slow
130825 ip_local_deliver_finish
128795 ip_rcv
125995 selinux_socket_sock_rcv_skb
123944 net_rx_action
123100 __copy_skb_header
122052 __inet_lookup
121744 constant_test_bit
119444 get_page_from_freelist
116486 avc_has_perm
115643 audit_syscall_exit
115123 find_pid_ns
114483 tcp_cleanup_rbuf
111350 tcp_rcv_established
109853 __mod_timer
107891 lock_sock_nested
107316 napi_disable_pending
106581 release_sock
104402 skb_copy_datagram_iovec
101591 __tcp_push_pending_frames
101206 tcp_event_data_recv
98046 kmem_cache_alloc_node
97982 tcp_v4_do_rcv
92714 sys_recvfrom
91551 rb_erase
89730 kfree
87979 ip_rcv_finish
87166 compare_ether_addr
86982 selinux_parse_skb
86731 nf_iterate
79690 selinux_ipv4_output
79347 __cache_free
78992 audit_free_names
78127 skb_release_data
77501 mod_timer
77241 __sock_recvmsg
77228 sock_recvmsg
77211 ____cache_alloc
76495 tcp_rcv_space_adjust
75283 sk_wait_data
71772 sys_sendto
71594 sched_clock
70880 eth_type_trans
70238 memcpy_toiovec
69193 do_softirq
68341 __update_sched_clock
67597 tcp_v4_md5_lookup
67424 try_to_wake_up
64465 sock_common_recvmsg
64116 put_prev_task_fair
63964 process_backlog
62216 __do_softirq
62093 tcp_cwnd_validate
61128 __alloc_skb
60588 put_page
59536 dput
58411 __ip_local_out
56349 avc_audit
55626 __napi_schedule
55525 selinux_ipv4_postroute
54499 __enqueue_entity
53599 local_bh_disable
53418 unroll_tree_refs
53162 __unlazy_fpu
53084 cfs_rq_of
52475 set_next_entity
51108 thread_return
50458 ip_output
50268 sched_clock_cpu
49974 tcp_send_delayed_ack
49736 ip_finish_output
49670 finish_task_switch
49070 ___swab16
48499 audit_get_context
48347 raw_local_deliver
47824 tcp_rtt_estimator
46707 tcp_push
46405 constant_test_bit
45859 select_task_rq_fair
45188 math_state_restore
44889 check_preempt_wakeup
44449 task_rq_lock
43704 sel_netif_sid
43377 sock_sendmsg
42612 sk_reset_timer
42606 __skb_clone
42223 __find_general_cachep
41950 selinux_socket_sendmsg
41716 constant_test_bit
41097 skb_push
40723 lock_sock
40715 system_call_after_swapgs
40399 selinux_netlbl_inode_permission
40179 rb_insert_color
40021 __kfree_skb
40015 sockfd_lookup_light
39216 internal_add_timer
39024 skb_can_coalesce
38838 __tcp_select_window
38651 current_kernel_time
38533 tcp_v4_md5_do_lookup
38372 __sock_sendmsg
38162 selinux_socket_recvmsg
37812 sel_netport_sid
37727 account_group_exec_runtime
37695 switch_mm
36247 nf_hook_thresh
36057 auditsys
35266 pick_next_task_fair
35064 __tcp_ack_snd_check
35052 sock_def_readable
34826 sysret_careful
34578 _local_bh_enable
34498 free_hot_cold_page
34338 kmap
34028 loopback_xmit
33320 sk_stream_alloc_skb
33269 test_ti_thread_flag
33219 skb_fill_page_desc
33049 tcp_is_cwnd_limited
33012 update_min_vruntime
32431 native_read_tsc
32398 dst_release
31661 get_pageblock_flags_group
31652 path_put
31516 tcp_push_pending_frames
31265 netif_needs_gso
31175 constant_test_bit
31077 __cycles_2_ns
30971 socket_has_perm
30893 __phys_addr
30867 lock_timer_base
30585 __wake_up
30456 ret_from_sys_call
30147 skb_release_all
29356 local_bh_enable
29334 __skb_insert
28681 tcp_cwnd_test
28652 __skb_dequeue
28612 prepare_to_wait
28268 kmem_cache_free
28193 set_bit
28149 dequeue_task_fair
27906 skb_header_pointer
27861 sys_kill
27803 selinux_task_kill
27627 audit_free_aux
27600 selinux_netlbl_sock_rcv_skb
26794 update_curr
26777 __alloc_pages_internal
26469 skb_entail
26458 pskb_may_pull
26216 inet_ehashfn
26075 call_softirq
26033 copy_from_user
25933 __local_bh_disable
25666 fget_light
25270 inet_csk_reset_xmit_timer
25071 signal_pending_state
24117 tcp_init_tso_segs
24109 TCP_ECN_check_ce
23702 nf_hook_thresh
23558 copy_to_user
23426 sysret_audit
23267 sk_wake_async
22627 tcp_options_write
22174 netif_tx_queue_stopped
21795 tcp_prequeue_process
21757 tcp_set_skb_tso_segs
21579 avc_hash
21565 ___swab16
21560 ip_local_out
21445 sk_wmem_schedule
21234 get_page
21200 __wake_up_common
21042 sel_netnode_find
20772 sock_put
20625 schedule_timeout
20613 __napi_complete
20563 fput_light
20532 tcp_bound_to_half_wnd
19912 cap_task_kill
19773 sysret_signal
19374 compound_head
19121 get_seconds
19048 PageLRU
18893 zone_watermark_ok
18635 tcp_snd_wnd_test
18634 enqueue_task_fair
18603 rb_next
18598 next_zones_zonelist
18534 resched_task
17820 hash_64
17801 autoremove_wake_function
17451 __skb_queue_before
17283 native_load_tls
17227 __skb_dequeue
17149 xfrm4_policy_check
16942 zone_statistics
16886 skb_reset_network_header
16824 ___swab16
16725 pskb_may_pull
16645 dev_hard_start_xmit
16580 sk_filter
16523 tcp_ca_event
16479 tcp_win_from_space
16408 tcp_parse_aligned_timestamp
16204 finish_wait
16124 virt_to_slab
15965 tcp_v4_send_check
15920 skb_reset_transport_header
15867 tcp_data_snd_check
15819 security_sock_rcv_skb
15665 tcp_ack_saw_tstamp
15621 skb_network_offset
15568 virt_to_head_page
15553 dst_confirm
15320 skb_pull
15277 clear_bit
15179 alloc_pages_current
14991 bictcp_acked
14743 tcp_store_ts_recent
14660 sel_netnode_sid
14650 __xchg
14573 task_has_perm
14561 tcp_v4_check
14492 net_invalid_timestamp
14485 security_socket_recvmsg
14363 __dequeue_entity
14318 pid_nr_ns
14311 device_not_available
14212 local_bh_enable_ip
14092 virt_to_cache
13804 netpoll_rx
13781 fcheck_files
13724 tcp_adjust_fackets_out
13717 net_timestamp
13638 ___swab16
13576 sel_netport_find
13563 __kmalloc_node
13530 __inc_zone_state
13215 pid_vnr
13208 free_pages_check
13008 security_socket_sendmsg
12971 ip_skb_dst_mtu
12827 __cpu_set
12782 bictcp_cong_avoid
12779 test_tsk_thread_flag
12734 wakeup_preempt_entity
12651 sel_netif_find
12545 skb_set_owner_r
12534 skb_headroom
12348 tcp_event_new_data_sent
12251 place_entity
12047 set_bit
11805 update_rq_clock
11788 detach_timer
11659 policy_zonelist
11423 skb_clone
11380 __skb_queue_tail
11249 dequeue_task
10823 init_rootdomain
10690 __cpu_clear
10558 default_wake_function
10556 tcp_rcv_rtt_measure_ts
10451 PageSlab
10427 sock_wfree
10277 calc_delta_fair
10237 tcp_validate_incoming
10218 task_rq_unlock
10023 page_get_cache
View attachment "config" of type "text/plain" (72925 bytes)
Powered by blists - more mailing lists