[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <4A450C13.9050907@itcare.pl>
Date: Fri, 26 Jun 2009 19:57:39 +0200
From: Paweł Staszewski <pstaszewski@...are.pl>
To: Eric Dumazet <dada1@...mosbay.com>
CC: Jarek Poplawski <jarkao2@...il.com>,
Eric Dumazet <eric.dumazet@...il.com>,
Linux Network Development list <netdev@...r.kernel.org>
Subject: Re: weird problem
Paweł Staszewski pisze:
> Eric Dumazet pisze:
>> Jarek Poplawski a écrit :
>>
>>> On Fri, Jun 26, 2009 at 08:37:19AM +0000, Jarek Poplawski wrote:
>>>
>>>> On 25-06-2009 22:18, Eric Dumazet wrote:
>>>>
>>>>> Pawe? Staszewski a ?crit :
>>>>>
>>>>>> Ok
>>>>>>
>>>>>> After this day of observation im near 100% sure that this cpu
>>>>>> load is
>>>>>> made by route cahce flushes
>>>>>> When route cache increase to its "net.ipv4.route.gc_thresh" size
>>>>>> or is
>>>>>> near that size
>>>>>> system is starting to drop some routes from cache then cpu load is
>>>>>> increase from 2% to near 80%
>>>>>> after cleaning / flush cache when cache is filling cpu load is again
>>>>>> normal 2%
>>>>>>
>>>>>> Someone know how to resolve this ?
>>>>>> on kernels < 2.6.29 i don't see this, all start after upgrade from
>>>>>> 2.6.28 to 2.6.29 - then i try 2.6.29.1 , 2.6.29.3 and 2.6.30 and
>>>>>> on all
>>>>>> this kernels >= 2.6.29 problem with cpu load is the same.
>>>>>>
>>>>>> I can minimize this cpu fluctuations by changing of route cache
>>>>>> /proc
>>>>>> parameters but the best result for my router was
>>>>>>
>>>>>> 15 sec of 2% cpu
>>>>>> and after
>>>>>> 15sec of 80% cpu
>>>>>>
>>>>>>
>>>>>> Regards
>>>>>> Pawel Staszewski
>>>>>>
>>>>> I believe this is known 2.6.29 regressions
>>>>>
>>>>> Following two commits should correct the problem you have
>>>>>
>>>>> Your best bet would be to try 2.6.31-rc1, and tell us if this
>>>>> recent kernel
>>>>> is ok on your machine ?
>>>>>
>>>> Btw., the first of these commits is in 2.6.30, which according to
>>>>
>>> And the second as well.
>>>
>>>
>>
>> Thanks Jarek.
>>
>> Pawel made some reports errors in fib thread, so I am not sure he really
>> tried 2.6.30 and had same oprofile results.
>>
>> rt_worker_func() taking 13% of cpu0 is an alarm for me :)
>> And 21% of cpu0 and 34% of cpu6 taken by oprofiled seems odd too...
>>
oprofile from:
Linux TM_01_C1 2.6.29.5 #1 SMP Fri Jun 26 19:11:30 UTC 2009 x86_64
Intel(R) Xeon(R) CPU X5450 @ 3.00GHz GenuineIntel GNU/Linux
CPU: Core 2, speed 3000.21 MHz (estimated)
Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a
unit mask of 0x00 (Unhalted core cycles) count 100000
Samples on CPU 0
Samples on CPU 1
Samples on CPU 2
Samples on CPU 3
Samples on CPU 4
Samples on CPU 5
Samples on CPU 6
Samples on CPU 7
samples % samples % samples % samples %
samples % samples % samples % samples %
image name
app name symbol name
9999 19.0926 0 0 0 0 0 0
0 0 0 0 0 0 0 0
vmlinux
vmlinux rt_worker_func
9017 17.2175 219 0.0277 13 0.0017 29 0.0814
26 0.3410 12 0.1420 13 0.1718 115 1.7649
vmlinux
vmlinux free_block
8867 16.9311 13588 1.7177 11888 1.5423 9000 25.2731
801 10.5063 461 5.4550 788 10.4136 476 7.3051
vmlinux
vmlinux mwait_idle
5714 10.9106 197 0.0249 143 0.0186 0 0
0 0 0 0 0 0 0 0
vmlinux
vmlinux dst_destroy
2917 5.5699 95 0.0120 76 0.0099 33 0.0927
3 0.0393 0 0 2 0.0264 0 0
vmlinux
vmlinux __call_rcu
2293 4.3784 74 0.0094 49 0.0064 48 0.1348
24 0.3148 12 0.1420 22 0.2907 12 0.1842
vmlinux
vmlinux __rcu_process_callbacks
1816 3.4676 24921 3.1504 22351 2.8997 1090 3.0609
443 5.8106 297 3.5144 414 5.4711 382 5.8625
vmlinux
vmlinux _raw_spin_lock
1055 2.0145 24 0.0030 21 0.0027 0 0
0 0 0 0 0 0 0 0
vmlinux
vmlinux dst_rcu_free
669 1.2774 752 0.0951 717 0.0930 3223 9.0506
679 8.9061 1527 18.0689 232 3.0659 517 7.9343
libc-2.8.so
libc-2.8.so (no symbols)
590 1.1266 1745 0.2206 1208 0.1567 83 0.2331
4 0.0525 5 0.0592 6 0.0793 4 0.0614
vmlinux
vmlinux kmem_cache_free
568 1.0846 36 0.0046 29 0.0038 0 0
0 0 0 0 0 0 0 0
vmlinux
vmlinux ipv4_dst_destroy
534 1.0196 583 0.0737 641 0.0832 691 1.9404
1402 18.3893 542 6.4134 934 12.3431 736 11.2953
vmlinux
vmlinux tg_shares_up
457 0.8726 20 0.0025 19 0.0025 0 0
0 0 0 0 0 0 0 0
vmlinux
vmlinux call_rcu_bh
422 0.8058 0 0 429 0.0557 0 0
0 0 845 9.9988 0 0 0 0 bgpd
bgpd bgp_best_selection
397 0.7581 0 0 585 0.0759 0 0
153 2.0068 917 10.8508 0 0 0 0 bgpd
bgpd bgp_route_next
339 0.6473 3085 0.3900 3301 0.4283 169 0.4746
20 0.2623 19 0.2248 36 0.4757 31 0.4758
vmlinux
vmlinux _raw_spin_unlock
319 0.6091 3645 0.4608 3122 0.4050 0 0
0 0 0 0 0 0 0 0
vmlinux
vmlinux local_bh_enable_ip
290 0.5537 0 0 40 0.0052 0 0
0 0 0 0 0 0 0 0
vmlinux
vmlinux e1000e_update_stats
271 0.5175 152 0.0192 146 0.0189 371 1.0418
626 8.2109 192 2.2719 334 4.4139 165 2.5322
vmlinux
vmlinux find_next_bit
228 0.4354 19 0.0024 7 9.1e-04 5 0.0140
15 0.1967 12 0.1420 10 0.1322 20 0.3069
vmlinux
vmlinux rcu_process_callbacks
217 0.4144 0 0 352 0.0457 0 0
84 1.1018 534 6.3188 0 0 0 0 bgpd
bgpd bgp_scan_timer
203 0.3876 694 0.0877 521 0.0676 6 0.0168
2 0.0262 0 0 2 0.0264 0 0
vmlinux
vmlinux __phys_addr
191 0.3647 182 0.0230 66 0.0086 208 0.5841
608 7.9748 259 3.0647 268 3.5417 134 2.0565
vmlinux
vmlinux find_busiest_group
186 0.3552 5221 0.6600 4432 0.5750 116 0.3257
131 1.7183 90 1.0650 122 1.6123 106 1.6268
vmlinux
>> Pawel, could you give us :
>>
>> grep . /proc/sys/net/ipv4/route/*
>> cat /proc/interrupts
>>
>> on your various kernels (previous to 2.6.29, 2.6.29, 2.6.30, ...)
>>
>> I suspect a change in hash table size, and/or change in interrupt
>> affinities...
>>
>>
>>
> first machine:
> Linux TM_01_C1 2.6.29.5 #1 SMP Fri Jun 26 19:11:30 UTC 2009 x86_64
> Intel(R) Xeon(R) CPU X5450 @ 3.00GHz GenuineIntel GNU/Linux
>
> grep . /proc/sys/net/ipv4/route/*
> /proc/sys/net/ipv4/route/error_burst:1250
> /proc/sys/net/ipv4/route/error_cost:250
> /proc/sys/net/ipv4/route/gc_elasticity:4
> /proc/sys/net/ipv4/route/gc_interval:1
> /proc/sys/net/ipv4/route/gc_min_interval:0
> /proc/sys/net/ipv4/route/gc_min_interval_ms:0
> /proc/sys/net/ipv4/route/gc_thresh:190536
> /proc/sys/net/ipv4/route/gc_timeout:15
> /proc/sys/net/ipv4/route/max_size:524288
> /proc/sys/net/ipv4/route/min_adv_mss:256
> /proc/sys/net/ipv4/route/min_pmtu:552
> /proc/sys/net/ipv4/route/mtu_expires:600
> /proc/sys/net/ipv4/route/redirect_load:5
> /proc/sys/net/ipv4/route/redirect_number:9
> /proc/sys/net/ipv4/route/redirect_silence:5120
> /proc/sys/net/ipv4/route/secret_interval:3600
>
> dmesg | grep route
> IP route cache hash table entries: 262144 (order: 9, 2097152 bytes)
>
>
> cat /proc/interrupts
> CPU0 CPU1 CPU2 CPU3 CPU4
> CPU5 CPU6 CPU7
> 0: 43 0 0 1 1
> 2 0 0 IO-APIC-edge timer
> 1: 0 0 0 1 0
> 0 0 1 IO-APIC-edge i8042
> 9: 0 0 0 0 0
> 0 0 0 IO-APIC-fasteoi acpi
> 14: 0 0 0 0 0
> 0 0 0 IO-APIC-edge ide0
> 15: 0 0 0 0 0
> 0 0 0 IO-APIC-edge ide1
> 29: 1139988 18351004 89662 3 0
> 1 0 3 PCI-MSI-edge eth0
> 30: 0 2 20221692 1 0
> 3 0 0 PCI-MSI-edge eth1
> 31: 0 1 1 0 0
> 0 0 0 PCI-MSI-edge
> 32: 0 0 0 0 0
> 0 2 0 PCI-MSI-edge
> 33: 1 1 0 0 0
> 0 0 0 PCI-MSI-edge
> 34: 0 0 0 1 0
> 1 0 0 PCI-MSI-edge
> 35: 0 0 0 1 0
> 0 0 1 PCI-MSI-edge
> 36: 0 0 0 0 1
> 0 0 1 PCI-MSI-edge
> 37: 1 0 0 0 0
> 1 0 0 PCI-MSI-edge
> 38: 0 0 1 0 1
> 0 0 0 PCI-MSI-edge
> 39: 0 0 2 0 0
> 0 0 0 PCI-MSI-edge
> 40: 0 0 0 0 0
> 0 2 0 PCI-MSI-edge
> 41: 0 2 0 0 0
> 0 0 0 PCI-MSI-edge
> 42: 0 0 0 0 0
> 2 0 0 PCI-MSI-edge
> 43: 0 0 0 2 0
> 0 0 0 PCI-MSI-edge
> 44: 0 0 0 0 0
> 0 0 2 PCI-MSI-edge
> 45: 2 0 0 0 0
> 0 0 0 PCI-MSI-edge
> 46: 0 0 0 0 2
> 0 0 0 PCI-MSI-edge
> 48: 233 200 185 257 256
> 260 269 257 PCI-MSI-edge ahci
> 49: 0 1 1 0 0
> 2 1 0 PCI-MSI-edge ioat-msi
> NMI: 0 0 0 0 0
> 0 0 0 Non-maskable interrupts
> LOC: 1191321 26059516 25803111 64841 32718
> 26651 54058 24166 Local timer interrupts
> RES: 921 59 58 20 14
> 8 10 13 Rescheduling interrupts
> CAL: 20 85 88 87 90
> 90 91 86 Function call interrupts
> TLB: 103 116 937 954 95
> 115 1006 1020 TLB shootdowns
> SPU: 0 0 0 0 0
> 0 0 0 Spurious interrupts
> ERR: 0
> MIS: 0
>
>
> second machine:
> Linux TM_02_C1 2.6.30 #1 SMP Thu Jun 25 21:49:58 CEST 2009 i686
> Intel(R) Xeon(R) CPU 3075 @ 2.66GHz GenuineIntel GNU/Linux
>
> cat /proc/interrupts
> CPU0 CPU1
> 0: 182 129 IO-APIC-edge timer
> 1: 1886 1672 IO-APIC-edge i8042
> 6: 1 1 IO-APIC-edge floppy
> 9: 0 0 IO-APIC-fasteoi acpi
> 12: 2 2 IO-APIC-edge i8042
> 14: 0 0 IO-APIC-edge ide0
> 15: 0 0 IO-APIC-edge ide1
> 27: 41793 26401 PCI-MSI-edge ahci
> 28: 13482 11260 PCI-MSI-edge eth2
> 29: 3 1326457765 PCI-MSI-edge eth1
> 30: 1240943198 137973134 PCI-MSI-edge eth0
> NMI: 0 0 Non-maskable interrupts
> LOC: 1607938599 1514565603 Local timer interrupts
> SPU: 0 0 Spurious interrupts
> RES: 1098 1190 Rescheduling interrupts
> CAL: 28 105 Function call interrupts
> TLB: 2886 3055 TLB shootdowns
> ERR: 0
> MIS: 0
>
> grep . /proc/sys/net/ipv4/route/*
> /proc/sys/net/ipv4/route/error_burst:1250
> /proc/sys/net/ipv4/route/error_cost:250
> /proc/sys/net/ipv4/route/gc_elasticity:4
> /proc/sys/net/ipv4/route/gc_interval:1
> /proc/sys/net/ipv4/route/gc_min_interval:0
> /proc/sys/net/ipv4/route/gc_min_interval_ms:0
> /proc/sys/net/ipv4/route/gc_thresh:190536
> /proc/sys/net/ipv4/route/gc_timeout:15
> /proc/sys/net/ipv4/route/max_size:1524288
> /proc/sys/net/ipv4/route/min_adv_mss:256
> /proc/sys/net/ipv4/route/min_pmtu:552
> /proc/sys/net/ipv4/route/mtu_expires:600
> /proc/sys/net/ipv4/route/redirect_load:5
> /proc/sys/net/ipv4/route/redirect_number:9
> /proc/sys/net/ipv4/route/redirect_silence:5120
> /proc/sys/net/ipv4/route/secret_interval:3600
>
>
> dmesg | grep route
> IP route cache hash table entries: 262144 (order: 9, 2097152 bytes)
>
> rtstat -k entries -i 1 -c 10
> rt_cache|
> entries|
> 112754|
> 112446|
> 112277|
> 111451|
> 111042|
> 110314|
> 109153|
> 108370|
> 107730|
> 107478|
>
>
>
>
>> Change in hash table size comes from commit
>> c9503e0fe052020e0294cd07d0ecd982eb7c9177
>>
>> But as Pawel mentioned "net.ipv4.route.gc_thresh = 190536", I believe
>> his hash table is smaller than 512k entries!
>>
>> Author: Anton Blanchard <anton@...ba.org>
>> Date: Mon Apr 27 05:42:24 2009 -0700
>>
>> ipv4: Limit size of route cache hash table
>>
>> Right now we have no upper limit on the size of the route cache
>> hash table.
>> On a 128GB POWER6 box it ends up as 32MB:
>>
>> IP route cache hash table entries: 4194304 (order: 9,
>> 33554432 bytes)
>>
>> It would be nice to cap this for memory consumption reasons, but
>> a massive
>> hashtable also causes a significant spike when measuring OS jitter.
>>
>> With a 32MB hashtable and 4 million entries, rt_worker_func is
>> taking
>> 5 ms to complete. On another system with more memory it's taking
>> 14 ms.
>> Even though rt_worker_func does call cond_sched() to limit its
>> impact,
>> in an HPC environment we want to keep all sources of OS jitter to
>> a minimum.
>>
>> With the patch applied we limit the number of entries to 512k which
>> can still be overriden by using the rt_entries boot option:
>>
>> IP route cache hash table entries: 524288 (order: 6, 4194304
>> bytes)
>>
>> With this patch rt_worker_func now takes 0.460 ms on the same
>> system.
>>
>> Signed-off-by: Anton Blanchard <anton@...ba.org>
>> Acked-by: Eric Dumazet <dada1@...mosbay.com>
>> Signed-off-by: David S. Miller <davem@...emloft.net>
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe netdev" in
>> the body of a message to majordomo@...r.kernel.org
>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>
>>
>>
>
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@...r.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
>
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Powered by blists - more mailing lists