[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <49F09F84.10108@cosmosbay.com>
Date: Thu, 23 Apr 2009 19:04:04 +0200
From: Eric Dumazet <dada1@...mosbay.com>
To: Christoph Lameter <cl@...ux.com>
CC: David Miller <davem@...emloft.net>,
Michael Chan <mchan@...adcom.com>,
Ben Hutchings <bhutchings@...arflare.com>,
netdev@...r.kernel.org
Subject: Re: udp ping pong measurements from 2.6.22 to .30 with various cpu
affinities
Christoph Lameter a écrit :
> On Wed, 22 Apr 2009, Eric Dumazet wrote:
>
>> Check /proc/cpuinfo, and check it doesnt change between kernel version.
>
> Hmmm.... It does not since it depends on the way that the machine is
> configured by the firmware.
>
>>> Results follow (a nice diagram is available from
>>> http://gentwo.org/results/udpping-tests-2.pdf
>> Nice graphs, but lack of documentation of test conditions.
>
> What would you like to know?
For example, "cat /proc/cpuinfo", just to be sure you did not invert cpus,
since your results are not that logical.
"size vmlinux-2.6.2[2-9] vmlinux-2.6.30" to check how kernel text size increase.
We add new features, security knobs, fixes, ... and this probably increase number
of instructions taken in a particular workload, who knows ?
What was the size of packets used on this last test ?
iptables active or not ? conntrack ?
Overall, many unknown factors that mean we have to guess what really
is happening.
>
>> We already pointed it was probably scheduling. Since ICMP pings dont use
>> processes and no regression here. Patching kernel to implement udpping
>> at softirq level should be quite easy if you really want to check UDP stack.
>>
>> Last network improvements focused on scalability more than latencies.
>> (multi-flows, not single flow !)
>
> Latencies have priority here. Multi-flows are secondary.
Amen :)
>
> So I guess this means that you are okay with the network stacks latency
> creep? Even if its only 1.5usec: In practice you tune the NIC to perform
> much better and this is a latency increase likely occurring in every
> packet transmission.
You keep mentioning "network latency" increases, while your test obviously
stress both network stack, scheduler, nic driver, memory allocator,
time infrastructure, and many kernel services.
1.5 us is about 4500 cycles, -> ~1000 instructions or so, not counting
hardware issues. Thats would not be a surprise given vmlinux text size.
number of cycles per network udpping round-trip can also be estimated with oprofile
(one NMI every 10000 CPU_CLK_UNHALTED events)
<target># egrep NMI|eth1 /proc/interrupts
34: 6371754 0 0 0 0 0 0 0 PCI-MSI-edge eth1
NMI: 8831219 141645 96321 113255 26232 44575 64762 23292 Non-maskable interrupts
<source># udpping -n 100000 192.168.20.110
udp ping 0.0.0.0:9001 -> 192.168.20.110:9000 100000 samples ....
10286138.24us (93.16us/102.86us/10896.49us)
<target># egrep "Non-maskable interrupts|eth1" /proc/interrupts
34: 6581203 0 0 0 0 0 0 0 PCI-MSI-edge eth1
NMI: 9167306 150313 97909 119027 31965 46523 68006 25070 Non-maskable interrupts
delta of eth1 irq = 6581203 - 6371754 = 209449 (2*100000 + residual trafic on this host)
and delta of NMI on CPU0 = 336087
-> 336087*10000 CPU_CLK_UNHALTED / 100000 packets
-> 33608 cycles per packet (only on ring 0 : kernel)
Shouldnt it be about 10 us ?
BTW on my machines, latest linux-2.6.git, with bnx2.msi_disabled=1, I have :
# udpping -l 40 -n 100000 192.168.20.110
udp ping 0.0.0.0:9001 -> 192.168.20.110:9000 100000 samples ...
7891442.22us (72.22us/78.91us/501.14us)
with bnx2.msi_disable=0 :
# udpping -l 40 -n 100000 192.168.20.110
udp ping 0.0.0.0:9001 -> 192.168.20.110:9000 100000 samples ....
7644318.91us (66.90us/76.44us/4036.34us)
So its quite a bit less than your numbers. (and I do use netfiler, bonding, vlan...)
# opreport -la vmlinux-2.6.30-rc3 | head -n 50
CPU: Core 2, speed 3000.36 MHz (estimated)
Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit mask of 0x00 (Unhalted core cycles) count 100000
samples cum. samples % cum. % symbol name
294498 294498 18.5543 18.5543 mwait_idle
63149 357647 3.9786 22.5329 __schedule
50071 407718 3.1546 25.6875 irq_entries_start
49798 457516 3.1374 28.8249 bnx2_poll_work
38037 495553 2.3964 31.2214 apic_timer_interrupt
36896 532449 2.3246 33.5459 sched_clock_cpu
27671 560120 1.7434 35.2893 find_busiest_group
25694 585814 1.6188 36.9081 bnx2_start_xmit
25516 611330 1.6076 38.5157 ipt_do_table
24312 635642 1.5317 40.0474 scheduler_tick
22728 658370 1.4319 41.4794 __switch_to
21672 680042 1.3654 42.8448 bnx2_msi
20957 700999 1.3204 44.1651 native_sched_clock
20156 721155 1.2699 45.4350 read_tsc
18813 739968 1.1853 46.6203 handle_edge_irq
17396 757364 1.0960 47.7163 __skb_recv_datagram
15010 772374 0.9457 48.6620 copy_to_user
14954 787328 0.9421 49.6041 udp_recvmsg
14557 801885 0.9171 50.5213 dev_queue_xmit
13727 815612 0.8648 51.3861 udp_sendmsg
12849 828461 0.8095 52.1956 cpu_idle
12632 841093 0.7959 52.9915 dequeue_task_fair
12145 853238 0.7652 53.7567 __alloc_skb
12135 865373 0.7645 54.5212 __qdisc_run
11528 876901 0.7263 55.2475 ip_append_data
10847 887748 0.6834 55.9309 getnstimeofday
10733 898481 0.6762 56.6071 netif_receive_skb
10598 909079 0.6677 57.2748 sysenter_past_esp
10557 919636 0.6651 57.9399 prepare_to_wait_exclusive
10162 929798 0.6402 58.5802 finish_task_switch
10120 939918 0.6376 59.2178 __do_softirq
9778 949696 0.6160 59.8338 ip_push_pending_frames
9594 959290 0.6045 60.4383 hrtimer_interrupt
9199 968489 0.5796 61.0178 csum_partial_copy_generic
9181 977670 0.5784 61.5963 copy_from_user
8641 986311 0.5444 62.1407 ktime_get_ts
8513 994824 0.5363 62.6770 __udp4_lib_rcv
8479 1003303 0.5342 63.2112 tick_sched_timer
8445 1011748 0.5321 63.7433 tick_check_oneshot_broadcast
8319 1020067 0.5241 64.2674 ip_rcv
8267 1028334 0.5208 64.7882 enqueue_task_fair
8010 1036344 0.5047 65.2929 rb_erase
7937 1044281 0.5001 65.7930 kfree
7923 1052204 0.4992 66.2921 ktime_get
7891 1060095 0.4972 66.7893 bnx2_poll
7702 1067797 0.4853 67.2745 update_curr
7629 1075426 0.4807 67.7552 try_to_wake_up
opannotate extracts :
c0202f40 <irq_entries_start>: /* irq_entries_start total: 55704 3.1467 */
...
55361 3.1273 :c0203000: push $0x35
251 0.0142 :c0203002: jmp c020301a <irq_entries_start+0xda>
...
80 0.0045 :c020301a: jmp c0203340 <common_interrupt>
Using NO_HZ mode gives higher latencies
[root@...voipvnx002 ~]# udpping -n 1000000 192.168.20.110
udp ping 0.0.0.0:9001 -> 192.168.20.110:9000 1000000 samples ....
101270858.22us (88.74us/101.27us/18833.35us)
and in this NO_HZ case, 'network part' is obviously not the more consuming area
Using opcontrol -p=cpu to separate events for each cpu, and giving here CPU0 only events :
CPU: Core 2, speed 3000.03 MHz (estimated)
Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit mask of 0x00 (Unhalted core cycles) count 10000
Samples on CPU 0
samples cum. samples % cum. % symbol name
34593 34593 7.3643 7.3643 mwait_idle
21077 55670 4.4870 11.8513 __schedule
17648 73318 3.7570 15.6083 irq_entries_start
14354 87672 3.0557 18.6640 bnx2_poll_work
13873 101545 2.9533 21.6174 getnstimeofday
12146 113691 2.5857 24.2031 rb_get_reader_page
11810 125501 2.5142 26.7172 get_next_timer_interrupt
11062 136563 2.3549 29.0722 copy_to_user (udpping recvfrom())
9980 146543 2.1246 31.1968 __switch_to
9011 155554 1.9183 33.1151 sched_clock_cpu
8081 163635 1.7203 34.8354 ring_buffer_consume (oprofile)
8045 171680 1.7127 36.5480 read_tsc
6717 178397 1.4299 37.9780 tick_nohz_stop_sched_tick
6696 185093 1.4255 39.4035 udp_recvmsg
6679 191772 1.4219 40.8253 __skb_recv_datagram
6322 198094 1.3459 42.1712 bnx2_msi
5821 203915 1.2392 43.4104 native_sched_clock
5269 209184 1.1217 44.5321 udp_sendmsg
4584 213768 0.9759 45.5079 dev_queue_xmit
3918 217686 0.8341 46.3420 copy_from_user
3887 221573 0.8275 47.1695 ip_append_data
3838 225411 0.8171 47.9865 sched_clock_tick
3806 229217 0.8102 48.7968 __alloc_skb
3680 232897 0.7834 49.5802 dequeue_task_fair
3635 236532 0.7738 50.3540 rb_erase
3330 239862 0.7089 51.0629 ipt_do_table
3300 243162 0.7025 51.7655 pick_next_task_fair
3259 246421 0.6938 52.4592 __hrtimer_start_range_ns
3164 249585 0.6736 53.1328 tick_check_idle
3107 252692 0.6614 53.7942 bnx2_start_xmit
3088 255780 0.6574 54.4516 prepare_to_wait_exclusive
3046 258826 0.6484 55.1001 sysenter_past_esp
2940 261766 0.6259 55.7260 finish_task_switch
2903 264669 0.6180 56.3440 handle_edge_irq
2848 267517 0.6063 56.9503 netif_receive_skb
2747 270264 0.5848 57.5351 __qdisc_run
2714 272978 0.5778 58.1128 rb_advance_reader
2649 275627 0.5639 58.6768 update_curr
2481 278108 0.5282 59.2049 sys_socketcall
2467 280575 0.5252 59.7301 ktime_get
2396 282971 0.5101 60.2402 kfree
2372 285343 0.5050 60.7451 enqueue_task_fair
If using INST_RETIRED_ANY_P (number of instructions retired) oprofiling :
CPU: Core 2, speed 3000.03 MHz (estimated)
Counted INST_RETIRED_ANY_P events (number of instructions retired) with a unit mask of 0x00 (No unit mask) count 10000
Samples on CPU 0
samples cum. samples % cum. % symbol name
45654 45654 12.6972 12.6972 get_next_timer_interrupt (!!!)
14583 60237 4.0558 16.7530 ipt_do_table
11819 72056 3.2871 20.0400 getnstimeofday
9096 81152 2.5298 22.5698 sched_clock_cpu
7534 88686 2.0953 24.6651 bnx2_poll_work
7163 95849 1.9922 26.6573 ring_buffer_consume
7089 102938 1.9716 28.6289 read_tsc
7006 109944 1.9485 30.5774 __schedule
6012 115956 1.6720 32.2494 csum_partial_copy_generic
5870 121826 1.6326 33.8820 rb_get_reader_page
5718 127544 1.5903 35.4722 tick_nohz_stop_sched_tick
5373 132917 1.4943 36.9666 ktime_get
5003 137920 1.3914 38.3580 dev_queue_xmit
4402 142322 1.2243 39.5823 set_normalized_timespec
4332 146654 1.2048 40.7871 clockevents_program_event
4263 150917 1.1856 41.9727 udp_sendmsg
4175 155092 1.1611 43.1338 copy_to_user
4113 159205 1.1439 44.2777 native_sched_clock
3920 163125 1.0902 45.3679 ip_append_data
3710 166835 1.0318 46.3998 tick_check_idle
3541 170376 0.9848 47.3846 handle_edge_irq
3335 173711 0.9275 48.3121 __hrtimer_start_range_ns
3319 177030 0.9231 49.2352 sched_clock_tick
3296 180326 0.9167 50.1519 nf_iterate
3225 183551 0.8969 51.0488 ktime_get_ts
3041 186592 0.8458 51.8945 rb_event_length
2954 189546 0.8216 52.7161 __qdisc_run
2840 192386 0.7899 53.5060 udp_recvmsg
2838 195224 0.7893 54.2952 copy_from_user
2780 198004 0.7732 55.0684 bnx2_start_xmit
2778 200782 0.7726 55.8410 ip_push_pending_frames
2672 203454 0.7431 56.5842 __switch_to
2423 205877 0.6739 57.2580 native_apic_mem_write
2349 208226 0.6533 57.9113 __do_softirq
2344 210570 0.6519 58.5632 net_rx_action
2309 212879 0.6422 59.2054 kfree
2226 215105 0.6191 59.8245 udp_push_pending_frames
2212 217317 0.6152 60.4397 __ip_route_output_key
2081 219398 0.5788 61.0185 __kmalloc_track_caller
2071 221469 0.5760 61.5944 dequeue_task_fair
2068 223537 0.5751 62.1696 ip_finish_output
2044 225581 0.5685 62.7381 __alloc_skb
I believe all this *time* services seem really expensive.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Powered by blists - more mailing lists