lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20081117110119.GL28786@elte.hu>
Date:	Mon, 17 Nov 2008 12:01:19 +0100
From:	Ingo Molnar <mingo@...e.hu>
To:	David Miller <davem@...emloft.net>
Cc:	rjw@...k.pl, linux-kernel@...r.kernel.org,
	kernel-testers@...r.kernel.org, cl@...ux-foundation.org,
	efault@....de, a.p.zijlstra@...llo.nl,
	Linus Torvalds <torvalds@...ux-foundation.org>
Subject: Re: [Bug #11308] tbench regression on each kernel release from
	2.6.22 -&gt; 2.6.28


* David Miller <davem@...emloft.net> wrote:

> From: Ingo Molnar <mingo@...e.hu>
> Date: Mon, 17 Nov 2008 10:06:48 +0100
> 
> > 
> > * Rafael J. Wysocki <rjw@...k.pl> wrote:
> > 
> > > This message has been generated automatically as a part of a report
> > > of regressions introduced between 2.6.26 and 2.6.27.
> > > 
> > > The following bug entry is on the current list of known regressions
> > > introduced between 2.6.26 and 2.6.27.  Please verify if it still should
> > > be listed and let me know (either way).
> > > 
> > > 
> > > Bug-Entry	: http://bugzilla.kernel.org/show_bug.cgi?id=11308
> > > Subject		: tbench regression on each kernel release from  2.6.22 -&gt; 2.6.28
> > > Submitter	: Christoph Lameter <cl@...ux-foundation.org>
> > > Date		: 2008-08-11 18:36 (98 days old)
> > > References	: http://marc.info/?l=linux-kernel&m=121847986119495&w=4
> > > 		  http://marc.info/?l=linux-kernel&m=122125737421332&w=4
> > 
> > Christoph, as per the recent analysis of Mike:
> > 
> >  http://fixunix.com/kernel/556867-regression-benchmark-throughput-loss-a622cf6-f7160c7-pull.html
> > 
> > all scheduler components of this regression have been eliminated.
> > 
> > In fact his numbers show that scheduler speedups since 2.6.22 have 
> > offset and hidden most other sources of tbench regression. (i.e. the 
> > scheduler portion got 5% faster, hence it was able to offset a 
> > slowdown of 5% in other areas of the kernel that tbench triggers)
> 
> Although I respect the improvements, wake_up() is still several 
> orders of magnitude slower than it was in 2.6.22 and wake_up() is at 
> the top of the profiles in tbench runs.

hm, several orders of magnitude slower? That contradicts Mike's 
numbers and my own numbers and profiles as well: see below.

The scheduler's overhead barely even registers on a 16-way x86 system 
i'm running tbench on. Here's the NMI profile during 64 threads tbench 
on a 16-way x86 box with an v2.6.28-rc5 kernel [config attached]:

  Throughput 3437.65 MB/sec 64 procs
  ==================================
  21570252  total 
  ........
   1494803  copy_user_generic_string 
    998232  sock_rfree 
    491471  tcp_ack 
    482405  ip_dont_fragment 
    470685  ip_local_deliver 
    436325  constant_test_bit         [ called by napi_disable_pending() ]
    375469  avc_has_perm_noaudit 
    347663  tcp_sendmsg 
    310383  tcp_recvmsg 
    300412  __inet_lookup_established 
    294377  system_call 
    286603  tcp_transmit_skb 
    251782  selinux_ip_postroute 
    236028  tcp_current_mss 
    235631  schedule 
    234013  netif_rx 
    229854  _local_bh_enable_ip 
    219501  tcp_v4_rcv 

    [ etc. - see full profile attached further below ]

Note that the scheduler does not even show up in the profile up to 
entry #15!

I've also summarized NMI profiler output by major subsystems:

           NET       overhead (12603450/21570252): 58.43%
           security  overhead ( 1903598/21570252):  8.83%
           usercopy  overhead ( 1753617/21570252):  8.13%
           sched     overhead ( 1599406/21570252):  7.41%
           syscall   overhead (  560487/21570252):  2.60%
           IRQ       overhead (  555439/21570252):  2.58%
           slab      overhead (  492421/21570252):  2.28%
           timer     overhead (  226573/21570252):  1.05%
           pagealloc overhead (  192681/21570252):  0.89%
           PID       overhead (  115123/21570252):  0.53%
           VFS       overhead (  107926/21570252):  0.50%
           pagecache overhead (   62552/21570252):  0.29%
           gtod      overhead (   38651/21570252):  0.18%
           IDLE      overhead (       0/21570252):  0.00%
---------------------------------------------------------
                         left ( 1349494/21570252):  6.26%

The scheduler's functions are absolutely flat, and consistent with an 
extreme context-switching rate of 1.35 million per second. The 
scheduler can go up to about 20 million context switches per second on 
this system:

 procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu------
  r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 32  0      0 32229696  29308 649880    0    0     0     0 164135 20026853 24 76  0  0  0
 32  0      0 32229752  29308 649880    0    0     0     0 164203 20032770 24 76  0  0  0
 32  0      0 32229752  29308 649880    0    0     0     0 164201 20036492 25 75  0  0  0

... and 7% scheduling overhead is roughly consistent with 1.35/20.0.

Wake up affinities and data flow caching is just fine in this workload 
- we've got scheduler statistics for that and they look good too.

It all looks like pure old-fashioned straight overhead in the 
networking layer to me. Do we still touch the same global cacheline 
for every localhost packet we process? Anything like that would show 
up big time.

Anyway, in terms of scheduling there's absolutely nothing anomalous i 
can see about this workload. Scheduling looks healthy throughout - and 
the few things we noticed causing unnecessary overhead are now fixed 
in -rc5. (but it's all in the <5% range of impact of total scheduling 
overhead - i.e. in the 0.4% absolute range in this workload)

And the thing is, the scheduler's task in this workload is by far the 
most difficult one conceptually: it has to manage and optimize 
concurrency of _future_ processing, with an event frequency that is 
_WAY_ out of the normal patterns: more than 1.3 million context 
switches per second (!). It also switches to/from completely 
independent contexts of computing, with the all the implications that 
this brings.

Networking and VFS "just" has to shuffle around bits in memory along a 
very specific plan given to it by user-space. That plan is 
well-specified and goes along the lines of: "copy this (already 
cached) file content to that socket" and back.

By the raw throughput figures the system is pushing a couple of 
million data packets per second.

Still we spend 7 times more CPU time in the networking code than in 
the scheduler or in the user-copy code. Why?

	Ingo

------------------------->
  21570252 total 
  ........
  1494803 copy_user_generic_string 
  998232 sock_rfree 
  491471 tcp_ack 
  482405 ip_dont_fragment 
  470685 ip_local_deliver 
  436325 constant_test_bit 
  375469 avc_has_perm_noaudit 
  347663 tcp_sendmsg 
  310383 tcp_recvmsg 
  300412 __inet_lookup_established 
  294377 system_call 
  286603 tcp_transmit_skb 
  251782 selinux_ip_postroute 
  236028 tcp_current_mss 
  235631 schedule 
  234013 netif_rx 
  229854 _local_bh_enable_ip 
  219501 tcp_v4_rcv 
  210046 netlbl_enabled 
  205022 constant_test_bit 
  199598 skb_release_head_state 
  187952 ip_queue_xmit 
  178779 tcp_established_options 
  175955 dev_queue_xmit 
  169904 netif_receive_skb 
  166629 ip_finish_output2 
  162291 sysret_check 
  151262 __switch_to 
  143355 audit_syscall_entry 
  142694 load_cr3 
  136571 memset_c 
  136115 nf_hook_slow 
  130825 ip_local_deliver_finish 
  128795 ip_rcv 
  125995 selinux_socket_sock_rcv_skb 
  123944 net_rx_action 
  123100 __copy_skb_header 
  122052 __inet_lookup 
  121744 constant_test_bit 
  119444 get_page_from_freelist 
  116486 avc_has_perm 
  115643 audit_syscall_exit 
  115123 find_pid_ns 
  114483 tcp_cleanup_rbuf 
  111350 tcp_rcv_established 
  109853 __mod_timer 
  107891 lock_sock_nested 
  107316 napi_disable_pending 
  106581 release_sock 
  104402 skb_copy_datagram_iovec 
  101591 __tcp_push_pending_frames 
  101206 tcp_event_data_recv 
   98046 kmem_cache_alloc_node
   97982 tcp_v4_do_rcv
   92714 sys_recvfrom
   91551 rb_erase
   89730 kfree
   87979 ip_rcv_finish
   87166 compare_ether_addr
   86982 selinux_parse_skb
   86731 nf_iterate
   79690 selinux_ipv4_output
   79347 __cache_free
   78992 audit_free_names
   78127 skb_release_data
   77501 mod_timer
   77241 __sock_recvmsg
   77228 sock_recvmsg
   77211 ____cache_alloc
   76495 tcp_rcv_space_adjust
   75283 sk_wait_data
   71772 sys_sendto
   71594 sched_clock
   70880 eth_type_trans
   70238 memcpy_toiovec
   69193 do_softirq
   68341 __update_sched_clock
   67597 tcp_v4_md5_lookup
   67424 try_to_wake_up
   64465 sock_common_recvmsg
   64116 put_prev_task_fair
   63964 process_backlog
   62216 __do_softirq
   62093 tcp_cwnd_validate
   61128 __alloc_skb
   60588 put_page
   59536 dput
   58411 __ip_local_out
   56349 avc_audit
   55626 __napi_schedule
   55525 selinux_ipv4_postroute
   54499 __enqueue_entity
   53599 local_bh_disable
   53418 unroll_tree_refs
   53162 __unlazy_fpu
   53084 cfs_rq_of
   52475 set_next_entity
   51108 thread_return
   50458 ip_output
   50268 sched_clock_cpu
   49974 tcp_send_delayed_ack
   49736 ip_finish_output
   49670 finish_task_switch
   49070 ___swab16
   48499 audit_get_context
   48347 raw_local_deliver
   47824 tcp_rtt_estimator
   46707 tcp_push
   46405 constant_test_bit
   45859 select_task_rq_fair
   45188 math_state_restore
   44889 check_preempt_wakeup
   44449 task_rq_lock
   43704 sel_netif_sid
   43377 sock_sendmsg
   42612 sk_reset_timer
   42606 __skb_clone
   42223 __find_general_cachep
   41950 selinux_socket_sendmsg
   41716 constant_test_bit
   41097 skb_push
   40723 lock_sock
   40715 system_call_after_swapgs
   40399 selinux_netlbl_inode_permission
   40179 rb_insert_color
   40021 __kfree_skb
   40015 sockfd_lookup_light
   39216 internal_add_timer
   39024 skb_can_coalesce
   38838 __tcp_select_window
   38651 current_kernel_time
   38533 tcp_v4_md5_do_lookup
   38372 __sock_sendmsg
   38162 selinux_socket_recvmsg
   37812 sel_netport_sid
   37727 account_group_exec_runtime
   37695 switch_mm
   36247 nf_hook_thresh
   36057 auditsys
   35266 pick_next_task_fair
   35064 __tcp_ack_snd_check
   35052 sock_def_readable
   34826 sysret_careful
   34578 _local_bh_enable
   34498 free_hot_cold_page
   34338 kmap
   34028 loopback_xmit
   33320 sk_stream_alloc_skb
   33269 test_ti_thread_flag
   33219 skb_fill_page_desc
   33049 tcp_is_cwnd_limited
   33012 update_min_vruntime
   32431 native_read_tsc
   32398 dst_release
   31661 get_pageblock_flags_group
   31652 path_put
   31516 tcp_push_pending_frames
   31265 netif_needs_gso
   31175 constant_test_bit
   31077 __cycles_2_ns
   30971 socket_has_perm
   30893 __phys_addr
   30867 lock_timer_base
   30585 __wake_up
   30456 ret_from_sys_call
   30147 skb_release_all
   29356 local_bh_enable
   29334 __skb_insert
   28681 tcp_cwnd_test
   28652 __skb_dequeue
   28612 prepare_to_wait
   28268 kmem_cache_free
   28193 set_bit
   28149 dequeue_task_fair
   27906 skb_header_pointer
   27861 sys_kill
   27803 selinux_task_kill
   27627 audit_free_aux
   27600 selinux_netlbl_sock_rcv_skb
   26794 update_curr
   26777 __alloc_pages_internal
   26469 skb_entail
   26458 pskb_may_pull
   26216 inet_ehashfn
   26075 call_softirq
   26033 copy_from_user
   25933 __local_bh_disable
   25666 fget_light
   25270 inet_csk_reset_xmit_timer
   25071 signal_pending_state
   24117 tcp_init_tso_segs
   24109 TCP_ECN_check_ce
   23702 nf_hook_thresh
   23558 copy_to_user
   23426 sysret_audit
   23267 sk_wake_async
   22627 tcp_options_write
   22174 netif_tx_queue_stopped
   21795 tcp_prequeue_process
   21757 tcp_set_skb_tso_segs
   21579 avc_hash
   21565 ___swab16
   21560 ip_local_out
   21445 sk_wmem_schedule
   21234 get_page
   21200 __wake_up_common
   21042 sel_netnode_find
   20772 sock_put
   20625 schedule_timeout
   20613 __napi_complete
   20563 fput_light
   20532 tcp_bound_to_half_wnd
   19912 cap_task_kill
   19773 sysret_signal
   19374 compound_head
   19121 get_seconds
   19048 PageLRU
   18893 zone_watermark_ok
   18635 tcp_snd_wnd_test
   18634 enqueue_task_fair
   18603 rb_next
   18598 next_zones_zonelist
   18534 resched_task
   17820 hash_64
   17801 autoremove_wake_function
   17451 __skb_queue_before
   17283 native_load_tls
   17227 __skb_dequeue
   17149 xfrm4_policy_check
   16942 zone_statistics
   16886 skb_reset_network_header
   16824 ___swab16
   16725 pskb_may_pull
   16645 dev_hard_start_xmit
   16580 sk_filter
   16523 tcp_ca_event
   16479 tcp_win_from_space
   16408 tcp_parse_aligned_timestamp
   16204 finish_wait
   16124 virt_to_slab
   15965 tcp_v4_send_check
   15920 skb_reset_transport_header
   15867 tcp_data_snd_check
   15819 security_sock_rcv_skb
   15665 tcp_ack_saw_tstamp
   15621 skb_network_offset
   15568 virt_to_head_page
   15553 dst_confirm
   15320 skb_pull
   15277 clear_bit
   15179 alloc_pages_current
   14991 bictcp_acked
   14743 tcp_store_ts_recent
   14660 sel_netnode_sid
   14650 __xchg
   14573 task_has_perm
   14561 tcp_v4_check
   14492 net_invalid_timestamp
   14485 security_socket_recvmsg
   14363 __dequeue_entity
   14318 pid_nr_ns
   14311 device_not_available
   14212 local_bh_enable_ip
   14092 virt_to_cache
   13804 netpoll_rx
   13781 fcheck_files
   13724 tcp_adjust_fackets_out
   13717 net_timestamp
   13638 ___swab16
   13576 sel_netport_find
   13563 __kmalloc_node
   13530 __inc_zone_state
   13215 pid_vnr
   13208 free_pages_check
   13008 security_socket_sendmsg
   12971 ip_skb_dst_mtu
   12827 __cpu_set
   12782 bictcp_cong_avoid
   12779 test_tsk_thread_flag
   12734 wakeup_preempt_entity
   12651 sel_netif_find
   12545 skb_set_owner_r
   12534 skb_headroom
   12348 tcp_event_new_data_sent
   12251 place_entity
   12047 set_bit
   11805 update_rq_clock
   11788 detach_timer
   11659 policy_zonelist
   11423 skb_clone
   11380 __skb_queue_tail
   11249 dequeue_task
   10823 init_rootdomain
   10690 __cpu_clear
   10558 default_wake_function
   10556 tcp_rcv_rtt_measure_ts
   10451 PageSlab
   10427 sock_wfree
   10277 calc_delta_fair
   10237 tcp_validate_incoming
   10218 task_rq_unlock
   10023 page_get_cache

View attachment "config" of type "text/plain" (72925 bytes)

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ