lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:	Thu, 23 Apr 2009 19:04:04 +0200
From:	Eric Dumazet <dada1@...mosbay.com>
To:	Christoph Lameter <cl@...ux.com>
CC:	David Miller <davem@...emloft.net>,
	Michael Chan <mchan@...adcom.com>,
	Ben Hutchings <bhutchings@...arflare.com>,
	netdev@...r.kernel.org
Subject: Re: udp ping pong measurements from 2.6.22 to .30 with various cpu
 affinities

Christoph Lameter a écrit :
> On Wed, 22 Apr 2009, Eric Dumazet wrote:
> 
>> Check /proc/cpuinfo, and check it doesnt change between kernel version.
> 
> Hmmm.... It does not since it depends on the way that the machine is
> configured by the firmware.
> 
>>> Results follow (a nice diagram is available from
>>> http://gentwo.org/results/udpping-tests-2.pdf
>> Nice graphs, but lack of documentation of test conditions.
> 
> What would you like to know?

For example, "cat /proc/cpuinfo", just to be sure you did not invert cpus,
since your results are not that logical.

"size vmlinux-2.6.2[2-9] vmlinux-2.6.30" to check how kernel text size increase.
We add new features, security knobs, fixes, ... and this probably increase number
of instructions taken in a particular workload, who knows ?

What was the size of packets used on this last test ?

iptables active or not ? conntrack ?

Overall, many unknown factors that mean we have to guess what really
is happening.

> 
>> We already pointed it was probably scheduling. Since ICMP pings dont use
>> processes and no regression here. Patching kernel to implement udpping
>> at softirq level should be quite easy if you really want to check UDP stack.
>>
>> Last network improvements focused on scalability more than latencies.
>> (multi-flows, not single flow !)
> 
> Latencies have priority here. Multi-flows are secondary.

Amen :)

> 
> So I guess this means that you are okay with the network stacks latency
> creep? Even if its only 1.5usec: In practice you tune the NIC to perform
> much better and this is a latency increase likely occurring in every
> packet transmission.

You keep mentioning "network latency" increases, while your test obviously
stress both network stack, scheduler, nic driver, memory allocator, 
time infrastructure, and many kernel services.

1.5 us is about 4500 cycles, -> ~1000 instructions or so, not counting
hardware issues. Thats would not be a surprise given vmlinux text size.

number of cycles per network udpping round-trip can also be estimated with oprofile
(one NMI every 10000 CPU_CLK_UNHALTED events)

<target># egrep NMI|eth1 /proc/interrupts
 34:    6371754          0          0          0          0          0          0          0   PCI-MSI-edge      eth1
NMI:    8831219     141645      96321     113255      26232      44575      64762      23292   Non-maskable interrupts

<source># udpping -n 100000 192.168.20.110
udp ping 0.0.0.0:9001 -> 192.168.20.110:9000 100000 samples ....
10286138.24us (93.16us/102.86us/10896.49us)

<target># egrep "Non-maskable interrupts|eth1" /proc/interrupts
 34:    6581203          0          0          0          0          0          0          0   PCI-MSI-edge      eth1
NMI:    9167306     150313      97909     119027      31965      46523      68006      25070   Non-maskable interrupts

delta of eth1 irq = 6581203 - 6371754 = 209449   (2*100000 + residual trafic on this host)
and delta of NMI on CPU0 = 336087

-> 336087*10000 CPU_CLK_UNHALTED / 100000 packets

-> 33608 cycles per packet (only on ring 0 : kernel)
Shouldnt it be about 10 us ?

BTW on my machines, latest linux-2.6.git, with bnx2.msi_disabled=1, I have :

# udpping -l 40 -n 100000 192.168.20.110
udp ping 0.0.0.0:9001 -> 192.168.20.110:9000 100000 samples ...
7891442.22us (72.22us/78.91us/501.14us)

with bnx2.msi_disable=0 :
# udpping -l 40 -n 100000 192.168.20.110
udp ping 0.0.0.0:9001 -> 192.168.20.110:9000 100000 samples ....
7644318.91us (66.90us/76.44us/4036.34us)


So its quite a bit less than your numbers. (and I do use netfiler, bonding, vlan...)

# opreport -la vmlinux-2.6.30-rc3 | head -n 50
CPU: Core 2, speed 3000.36 MHz (estimated)
Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit mask of 0x00 (Unhalted core cycles) count 100000
samples  cum. samples  %        cum. %     symbol name
294498   294498        18.5543  18.5543    mwait_idle
63149    357647         3.9786  22.5329    __schedule
50071    407718         3.1546  25.6875    irq_entries_start
49798    457516         3.1374  28.8249    bnx2_poll_work
38037    495553         2.3964  31.2214    apic_timer_interrupt
36896    532449         2.3246  33.5459    sched_clock_cpu
27671    560120         1.7434  35.2893    find_busiest_group
25694    585814         1.6188  36.9081    bnx2_start_xmit
25516    611330         1.6076  38.5157    ipt_do_table
24312    635642         1.5317  40.0474    scheduler_tick
22728    658370         1.4319  41.4794    __switch_to
21672    680042         1.3654  42.8448    bnx2_msi
20957    700999         1.3204  44.1651    native_sched_clock
20156    721155         1.2699  45.4350    read_tsc
18813    739968         1.1853  46.6203    handle_edge_irq
17396    757364         1.0960  47.7163    __skb_recv_datagram
15010    772374         0.9457  48.6620    copy_to_user
14954    787328         0.9421  49.6041    udp_recvmsg
14557    801885         0.9171  50.5213    dev_queue_xmit
13727    815612         0.8648  51.3861    udp_sendmsg
12849    828461         0.8095  52.1956    cpu_idle
12632    841093         0.7959  52.9915    dequeue_task_fair
12145    853238         0.7652  53.7567    __alloc_skb
12135    865373         0.7645  54.5212    __qdisc_run
11528    876901         0.7263  55.2475    ip_append_data
10847    887748         0.6834  55.9309    getnstimeofday
10733    898481         0.6762  56.6071    netif_receive_skb
10598    909079         0.6677  57.2748    sysenter_past_esp
10557    919636         0.6651  57.9399    prepare_to_wait_exclusive
10162    929798         0.6402  58.5802    finish_task_switch
10120    939918         0.6376  59.2178    __do_softirq
9778     949696         0.6160  59.8338    ip_push_pending_frames
9594     959290         0.6045  60.4383    hrtimer_interrupt
9199     968489         0.5796  61.0178    csum_partial_copy_generic
9181     977670         0.5784  61.5963    copy_from_user
8641     986311         0.5444  62.1407    ktime_get_ts
8513     994824         0.5363  62.6770    __udp4_lib_rcv
8479     1003303        0.5342  63.2112    tick_sched_timer
8445     1011748        0.5321  63.7433    tick_check_oneshot_broadcast
8319     1020067        0.5241  64.2674    ip_rcv
8267     1028334        0.5208  64.7882    enqueue_task_fair
8010     1036344        0.5047  65.2929    rb_erase
7937     1044281        0.5001  65.7930    kfree
7923     1052204        0.4992  66.2921    ktime_get
7891     1060095        0.4972  66.7893    bnx2_poll
7702     1067797        0.4853  67.2745    update_curr
7629     1075426        0.4807  67.7552    try_to_wake_up



opannotate extracts :

c0202f40 <irq_entries_start>: /* irq_entries_start total:  55704  3.1467 */
...
 55361  3.1273 :c0203000:       push   $0x35
   251  0.0142 :c0203002:       jmp    c020301a <irq_entries_start+0xda>
...
    80  0.0045 :c020301a:       jmp    c0203340 <common_interrupt>

Using NO_HZ mode gives higher latencies

[root@...voipvnx002 ~]# udpping -n 1000000 192.168.20.110
udp ping 0.0.0.0:9001 -> 192.168.20.110:9000 1000000 samples ....
101270858.22us (88.74us/101.27us/18833.35us)

and in this NO_HZ case, 'network part' is obviously not the more consuming area



Using opcontrol -p=cpu to separate events for each cpu, and giving here CPU0 only events :

CPU: Core 2, speed 3000.03 MHz (estimated)
Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit mask of 0x00 (Unhalted core cycles) count 10000
Samples on CPU 0

samples  cum. samples  %        cum. %     symbol name
34593    34593          7.3643   7.3643    mwait_idle
21077    55670          4.4870  11.8513    __schedule
17648    73318          3.7570  15.6083    irq_entries_start
14354    87672          3.0557  18.6640    bnx2_poll_work
13873    101545         2.9533  21.6174    getnstimeofday
12146    113691         2.5857  24.2031    rb_get_reader_page
11810    125501         2.5142  26.7172    get_next_timer_interrupt
11062    136563         2.3549  29.0722    copy_to_user             (udpping recvfrom())
9980     146543         2.1246  31.1968    __switch_to
9011     155554         1.9183  33.1151    sched_clock_cpu
8081     163635         1.7203  34.8354    ring_buffer_consume    (oprofile)
8045     171680         1.7127  36.5480    read_tsc
6717     178397         1.4299  37.9780    tick_nohz_stop_sched_tick
6696     185093         1.4255  39.4035    udp_recvmsg
6679     191772         1.4219  40.8253    __skb_recv_datagram
6322     198094         1.3459  42.1712    bnx2_msi
5821     203915         1.2392  43.4104    native_sched_clock
5269     209184         1.1217  44.5321    udp_sendmsg
4584     213768         0.9759  45.5079    dev_queue_xmit
3918     217686         0.8341  46.3420    copy_from_user
3887     221573         0.8275  47.1695    ip_append_data
3838     225411         0.8171  47.9865    sched_clock_tick
3806     229217         0.8102  48.7968    __alloc_skb
3680     232897         0.7834  49.5802    dequeue_task_fair
3635     236532         0.7738  50.3540    rb_erase
3330     239862         0.7089  51.0629    ipt_do_table
3300     243162         0.7025  51.7655    pick_next_task_fair
3259     246421         0.6938  52.4592    __hrtimer_start_range_ns
3164     249585         0.6736  53.1328    tick_check_idle
3107     252692         0.6614  53.7942    bnx2_start_xmit
3088     255780         0.6574  54.4516    prepare_to_wait_exclusive
3046     258826         0.6484  55.1001    sysenter_past_esp
2940     261766         0.6259  55.7260    finish_task_switch
2903     264669         0.6180  56.3440    handle_edge_irq
2848     267517         0.6063  56.9503    netif_receive_skb
2747     270264         0.5848  57.5351    __qdisc_run
2714     272978         0.5778  58.1128    rb_advance_reader
2649     275627         0.5639  58.6768    update_curr
2481     278108         0.5282  59.2049    sys_socketcall
2467     280575         0.5252  59.7301    ktime_get
2396     282971         0.5101  60.2402    kfree
2372     285343         0.5050  60.7451    enqueue_task_fair

If using INST_RETIRED_ANY_P (number of instructions retired) oprofiling :

CPU: Core 2, speed 3000.03 MHz (estimated)
Counted INST_RETIRED_ANY_P events (number of instructions retired) with a unit mask of 0x00 (No unit mask) count 10000
Samples on CPU 0
samples  cum. samples  %        cum. %     symbol name
45654    45654         12.6972  12.6972    get_next_timer_interrupt  (!!!)
14583    60237          4.0558  16.7530    ipt_do_table
11819    72056          3.2871  20.0400    getnstimeofday
9096     81152          2.5298  22.5698    sched_clock_cpu
7534     88686          2.0953  24.6651    bnx2_poll_work
7163     95849          1.9922  26.6573    ring_buffer_consume
7089     102938         1.9716  28.6289    read_tsc
7006     109944         1.9485  30.5774    __schedule
6012     115956         1.6720  32.2494    csum_partial_copy_generic
5870     121826         1.6326  33.8820    rb_get_reader_page
5718     127544         1.5903  35.4722    tick_nohz_stop_sched_tick
5373     132917         1.4943  36.9666    ktime_get
5003     137920         1.3914  38.3580    dev_queue_xmit
4402     142322         1.2243  39.5823    set_normalized_timespec
4332     146654         1.2048  40.7871    clockevents_program_event
4263     150917         1.1856  41.9727    udp_sendmsg
4175     155092         1.1611  43.1338    copy_to_user
4113     159205         1.1439  44.2777    native_sched_clock
3920     163125         1.0902  45.3679    ip_append_data
3710     166835         1.0318  46.3998    tick_check_idle
3541     170376         0.9848  47.3846    handle_edge_irq
3335     173711         0.9275  48.3121    __hrtimer_start_range_ns
3319     177030         0.9231  49.2352    sched_clock_tick
3296     180326         0.9167  50.1519    nf_iterate
3225     183551         0.8969  51.0488    ktime_get_ts
3041     186592         0.8458  51.8945    rb_event_length
2954     189546         0.8216  52.7161    __qdisc_run
2840     192386         0.7899  53.5060    udp_recvmsg
2838     195224         0.7893  54.2952    copy_from_user
2780     198004         0.7732  55.0684    bnx2_start_xmit
2778     200782         0.7726  55.8410    ip_push_pending_frames
2672     203454         0.7431  56.5842    __switch_to
2423     205877         0.6739  57.2580    native_apic_mem_write
2349     208226         0.6533  57.9113    __do_softirq
2344     210570         0.6519  58.5632    net_rx_action
2309     212879         0.6422  59.2054    kfree
2226     215105         0.6191  59.8245    udp_push_pending_frames
2212     217317         0.6152  60.4397    __ip_route_output_key
2081     219398         0.5788  61.0185    __kmalloc_track_caller
2071     221469         0.5760  61.5944    dequeue_task_fair
2068     223537         0.5751  62.1696    ip_finish_output
2044     225581         0.5685  62.7381    __alloc_skb

I believe all this *time* services seem really expensive.

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ