[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAGVrzcbRycBy0w64R9pV=JG6M3aJeARbOnh-xRrumYzzVDgWGQ@mail.gmail.com>
Date: Tue, 4 Mar 2014 17:16:27 -0800
From: Florian Fainelli <f.fainelli@...il.com>
To: Eric Dumazet <eric.dumazet@...il.com>
Cc: "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
linux-mm <linux-mm@...ck.org>,
paulmck <paulmck@...ux.vnet.ibm.com>,
linux-nfs <linux-nfs@...r.kernel.org>,
"trond.myklebust" <trond.myklebust@...marydata.com>,
netdev <netdev@...r.kernel.org>
Subject: Re: RCU stalls when running out of memory on 3.14-rc4 w/ NFS and
kernel threads priorities changed
2014-03-04 17:03 GMT-08:00 Florian Fainelli <f.fainelli@...il.com>:
> 2014-03-04 16:48 GMT-08:00 Eric Dumazet <eric.dumazet@...il.com>:
>> On Tue, 2014-03-04 at 15:55 -0800, Florian Fainelli wrote:
>>> Hi all,
>>>
>>> I am seeing the following RCU stalls messages appearing on an ARMv7
>>> 4xCPUs system running 3.14-rc4:
>>>
>>> [ 42.974327] INFO: rcu_sched detected stalls on CPUs/tasks:
>>> [ 42.979839] (detected by 0, t=2102 jiffies, g=4294967082,
>>> c=4294967081, q=516)
>>> [ 42.987169] INFO: Stall ended before state dump start
>>>
>>> this is happening under the following conditions:
>>>
>>> - the attached bumper.c binary alters various kernel thread priorities
>>> based on the contents of bumpup.cfg and
>>> - malloc_crazy is running from a NFS share
>>> - malloc_crazy.c is running in a loop allocating chunks of memory but
>>> never freeing it
>>>
>>> when the priorities are altered, instead of getting the OOM killer to
>>> be invoked, the RCU stalls are happening. Taking NFS out of the
>>> equation does not allow me to reproduce the problem even with the
>>> priorities altered.
>>>
>>> This "problem" seems to have been there for quite a while now since I
>>> was able to get 3.8.13 to trigger that bug as well, with a slightly
>>> more detailed RCU debugging trace which points the finger at kswapd0.
>>>
>>> You should be able to get that reproduced under QEMU with the
>>> Versatile Express platform emulating a Cortex A15 CPU and the attached
>>> files.
>>>
>>> Any help or suggestions would be greatly appreciated. Thanks!
>>
>> Do you have a more complete trace, including stack traces ?
>
> Attatched is what I get out of SysRq-t, which is the only thing I have
> (note that the kernel is built with CONFIG_RCU_CPU_STALL_INFO=y):
QEMU for Versatile Express w/ 2 CPUs yields something slightly
different than the real HW platform this is happening with, but it
does produce the RCU stall anyway:
[ 125.762946] BUG: soft lockup - CPU#1 stuck for 53s! [malloc_crazy:91]
[ 125.766841] Modules linked in:
[ 125.768389]
[ 125.769199] CPU: 1 PID: 91 Comm: malloc_crazy Not tainted 3.14.0-rc4 #39
[ 125.769883] task: edbded00 ti: c089c000 task.ti: c089c000
[ 125.771743] PC is at load_balance+0x4b0/0x760
[ 125.772069] LR is at cpumask_next_and+0x44/0x5c
[ 125.772387] pc : [<c004ff58>] lr : [<c01db940>] psr: 60000113
[ 125.772387] sp : c089db48 ip : 80000113 fp : edfd8cf4
[ 125.773128] r10: c0de871c r9 : ed893840 r8 : 00000000
[ 125.773452] r7 : c0de8458 r6 : edfd8840 r5 : edfd8840 r4 : ed89389c
[ 125.773825] r3 : 000012d8 r2 : 80000113 r1 : 00000023 r0 : 00000000
[ 125.774332] Flags: nZCv IRQs on FIQs on Mode SVC_32 ISA ARM Segment user
[ 125.774753] Control: 30c7387d Table: 80835d40 DAC: 00000000
[ 125.775266] CPU: 1 PID: 91 Comm: malloc_test_bcm Not tainted 3.14.0-rc4 #39
[ 125.776392] [<c0015624>] (unwind_backtrace) from [<c00111a4>]
(show_stack+0x10/0x14)
[ 125.777026] [<c00111a4>] (show_stack) from [<c04c1bd4>]
(dump_stack+0x84/0x94)
[ 125.777429] [<c04c1bd4>] (dump_stack) from [<c007e7f0>]
(watchdog_timer_fn+0x144/0x17c)
[ 125.777865] [<c007e7f0>] (watchdog_timer_fn) from [<c003f58c>]
(__run_hrtimer.isra.32+0x54/0xe4)
[ 125.778333] [<c003f58c>] (__run_hrtimer.isra.32) from [<c003fea4>]
(hrtimer_interrupt+0x11c/0x2d0)
[ 125.778814] [<c003fea4>] (hrtimer_interrupt) from [<c03c4f80>]
(arch_timer_handler_virt+0x28/0x30)
[ 125.779297] [<c03c4f80>] (arch_timer_handler_virt) from
[<c006280c>] (handle_percpu_devid_irq+0x6c/0x84)
[ 125.779734] [<c006280c>] (handle_percpu_devid_irq) from
[<c005edec>] (generic_handle_irq+0x2c/0x3c)
[ 125.780145] [<c005edec>] (generic_handle_irq) from [<c000eb7c>]
(handle_IRQ+0x40/0x90)
[ 125.780513] [<c000eb7c>] (handle_IRQ) from [<c0008568>]
(gic_handle_irq+0x2c/0x5c)
[ 125.780867] [<c0008568>] (gic_handle_irq) from [<c0011d00>]
(__irq_svc+0x40/0x50)
[ 125.781312] Exception stack(0xc089db00 to 0xc089db48)
[ 125.781787] db00: 00000000 00000023 80000113 000012d8 ed89389c
edfd8840 edfd8840 c0de8458
[ 125.782234] db20: 00000000 ed893840 c0de871c edfd8cf4 80000113
c089db48 c01db940 c004ff58
[ 125.782594] db40: 60000113 ffffffff
[ 125.782864] [<c0011d00>] (__irq_svc) from [<c004ff58>]
(load_balance+0x4b0/0x760)
[ 125.783215] [<c004ff58>] (load_balance) from [<c005035c>]
(rebalance_domains+0x154/0x284)
[ 125.783595] [<c005035c>] (rebalance_domains) from [<c00504c0>]
(run_rebalance_domains+0x34/0x164)
[ 125.784000] [<c00504c0>] (run_rebalance_domains) from [<c0025aac>]
(__do_softirq+0x110/0x24c)
[ 125.784388] [<c0025aac>] (__do_softirq) from [<c0025e6c>]
(irq_exit+0xac/0xf4)
[ 125.784726] [<c0025e6c>] (irq_exit) from [<c000eb80>] (handle_IRQ+0x44/0x90)
[ 125.785059] [<c000eb80>] (handle_IRQ) from [<c0008568>]
(gic_handle_irq+0x2c/0x5c)
[ 125.785412] [<c0008568>] (gic_handle_irq) from [<c0011d00>]
(__irq_svc+0x40/0x50)
[ 125.785742] Exception stack(0xc089dcf8 to 0xc089dd40)
[ 125.785983] dce0:
ee4e38c0 00000000
[ 125.786360] dd00: 000200da 00000001 ee4e38a0 c0de2340 2d201000
edfe3358 c05c0c18 00000001
[ 125.786737] dd20: c05c0c2c c0e1e180 00000000 c089dd40 c0086050
c0086140 40000113 ffffffff
[ 125.787120] [<c0011d00>] (__irq_svc) from [<c0086140>]
(get_page_from_freelist+0x2bc/0x638)
[ 125.787507] [<c0086140>] (get_page_from_freelist) from [<c0087018>]
(__alloc_pages_nodemask+0x114/0x8f4)
[ 125.787949] [<c0087018>] (__alloc_pages_nodemask) from [<c00a20c8>]
(handle_mm_fault+0x9f8/0xcdc)
[ 125.788357] [<c00a20c8>] (handle_mm_fault) from [<c001793c>]
(do_page_fault+0x194/0x27c)
[ 125.788726] [<c001793c>] (do_page_fault) from [<c000844c>]
(do_DataAbort+0x30/0x90)
[ 125.789080] [<c000844c>] (do_DataAbort) from [<c0011e34>]
(__dabt_usr+0x34/0x40)
[ 125.789403] Exception stack(0xc089dfb0 to 0xc089dff8)
[ 125.789652] dfa0: ae55e008
11111111 000dc000 ae582000
[ 125.790029] dfc0: be967d18 00000000 00008390 00000000 00000000
00000000 b6f29000 be967d04
[ 125.790404] dfe0: 11111111 be967ce8 00008550 b6e5ce48 20000010 ffffffff
[ 125.791282] BUG: soft lockup - CPU#0 stuck for 53s! [kworker/0:1:41]
[ 125.791710] Modules linked in:
[ 125.792046]
[ 125.792836] CPU: 0 PID: 41 Comm: kworker/0:1 Not tainted 3.14.0-rc4 #39
[ 125.793832] task: ed893840 ti: c0826000 task.ti: c0826000
[ 125.795257] PC is at finish_task_switch+0x40/0xec
[ 125.795562] LR is at __schedule+0x1fc/0x51c
[ 125.795856] pc : [<c0044724>] lr : [<c04c3274>] psr: 600f0013
[ 125.795856] sp : c0827ec8 ip : 00000000 fp : c0827edc
[ 125.796416] r10: eda56e00 r9 : c0deac28 r8 : 00000000
[ 125.796698] r7 : c0de871c r6 : c0826020 r5 : ed893840 r4 : 00000001
[ 125.797028] r3 : eda56e00 r2 : 000012d7 r1 : ed854780 r0 : edfd8840
[ 125.797488] Flags: nZCv IRQs on FIQs on Mode SVC_32 ISA ARM
Segment kernel
[ 125.797866] Control: 30c7387d Table: 808353c0 DAC: fffffffd
[ 125.798321] CPU: 0 PID: 41 Comm: kworker/0:1 Not tainted 3.14.0-rc4 #39
[ 125.800604] [<c0015624>] (unwind_backtrace) from [<c00111a4>]
(show_stack+0x10/0x14)
[ 125.801205] [<c00111a4>] (show_stack) from [<c04c1bd4>]
(dump_stack+0x84/0x94)
[ 125.801786] [<c04c1bd4>] (dump_stack) from [<c007e7f0>]
(watchdog_timer_fn+0x144/0x17c)
[ 125.802238] [<c007e7f0>] (watchdog_timer_fn) from [<c003f58c>]
(__run_hrtimer.isra.32+0x54/0xe4)
[ 125.802679] [<c003f58c>] (__run_hrtimer.isra.32) from [<c003fea4>]
(hrtimer_interrupt+0x11c/0x2d0)
[ 125.803108] [<c003fea4>] (hrtimer_interrupt) from [<c03c4f80>]
(arch_timer_handler_virt+0x28/0x30)
[ 125.803530] [<c03c4f80>] (arch_timer_handler_virt) from
[<c006280c>] (handle_percpu_devid_irq+0x6c/0x84)
[ 125.803965] [<c006280c>] (handle_percpu_devid_irq) from
[<c005edec>] (generic_handle_irq+0x2c/0x3c)
[ 125.804380] [<c005edec>] (generic_handle_irq) from [<c000eb7c>]
(handle_IRQ+0x40/0x90)
[ 125.804774] [<c000eb7c>] (handle_IRQ) from [<c0008568>]
(gic_handle_irq+0x2c/0x5c)
[ 125.805181] [<c0008568>] (gic_handle_irq) from [<c0011d00>]
(__irq_svc+0x40/0x50)
[ 125.805706] Exception stack(0xc0827e80 to 0xc0827ec8)
[ 125.806185] 7e80: edfd8840 ed854780 000012d7 eda56e00 00000001
ed893840 c0826020 c0de871c
[ 125.806623] 7ea0: 00000000 c0deac28 eda56e00 c0827edc 00000000
c0827ec8 c04c3274 c0044724
[ 125.807032] 7ec0: 600f0013 ffffffff
[ 125.807328] [<c0011d00>] (__irq_svc) from [<c0044724>]
(finish_task_switch+0x40/0xec)
[ 125.807752] [<c0044724>] (finish_task_switch) from [<c04c3274>]
(__schedule+0x1fc/0x51c)
[ 125.808175] [<c04c3274>] (__schedule) from [<c0037590>]
(worker_thread+0x210/0x404)
[ 125.808570] [<c0037590>] (worker_thread) from [<c003cba0>]
(kthread+0xd4/0xec)
[ 125.808952] [<c003cba0>] (kthread) from [<c000e338>]
(ret_from_fork+0x14/0x3c)
[ 246.080014] INFO: rcu_sched detected stalls on CPUs/tasks:
[ 246.080611] (detected by 0, t=6972 jiffies, g=4294967160,
c=4294967159, q=127)
[ 246.081576] INFO: Stall ended before state dump start
[ 246.082510] BUG: soft lockup - CPU#1 stuck for 69s! [grep:93]
[ 246.082849] Modules linked in:
[ 246.083046]
[ 246.083179] CPU: 1 PID: 93 Comm: grep Not tainted 3.14.0-rc4 #39
[ 246.083548] task: edbdf480 ti: c09e6000 task.ti: c09e6000
[ 246.083897] PC is at rebalance_domains+0x0/0x284
[ 246.084154] LR is at run_rebalance_domains+0x34/0x164
[ 246.084430] pc : [<c0050208>] lr : [<c00504c0>] psr: 20000113
[ 246.084430] sp : c09e7ef8 ip : c0dde340 fp : 40000005
[ 246.084992] r10: 00000007 r9 : c0ddf840 r8 : c0ddf840
[ 246.085267] r7 : edfe0840 r6 : 00000100 r5 : c0de209c r4 : 00000001
[ 246.085606] r3 : c005048c r2 : 2d201000 r1 : 00000001 r0 : edfe0840
[ 246.085944] Flags: nzCv IRQs on FIQs on Mode SVC_32 ISA ARM Segment user
[ 246.086315] Control: 30c7387d Table: 808351c0 DAC: 00000000
[ 246.086626] CPU: 1 PID: 93 Comm: grep Not tainted 3.14.0-rc4 #39
[ 246.086992] [<c0015624>] (unwind_backtrace) from [<c00111a4>]
(show_stack+0x10/0x14)
[ 246.087420] [<c00111a4>] (show_stack) from [<c04c1bd4>]
(dump_stack+0x84/0x94)
[ 246.087825] [<c04c1bd4>] (dump_stack) from [<c007e7f0>]
(watchdog_timer_fn+0x144/0x17c)
[ 246.088260] [<c007e7f0>] (watchdog_timer_fn) from [<c003f58c>]
(__run_hrtimer.isra.32+0x54/0xe4)
[ 246.088726] [<c003f58c>] (__run_hrtimer.isra.32) from [<c003fea4>]
(hrtimer_interrupt+0x11c/0x2d0)
[ 246.089190] [<c003fea4>] (hrtimer_interrupt) from [<c03c4f80>]
(arch_timer_handler_virt+0x28/0x30)
[ 246.089670] [<c03c4f80>] (arch_timer_handler_virt) from
[<c006280c>] (handle_percpu_devid_irq+0x6c/0x84)
[ 246.090155] [<c006280c>] (handle_percpu_devid_irq) from
[<c005edec>] (generic_handle_irq+0x2c/0x3c)
[ 246.090822] [<c005edec>] (generic_handle_irq) from [<c000eb7c>]
(handle_IRQ+0x40/0x90)
[ 246.091233] [<c000eb7c>] (handle_IRQ) from [<c0008568>]
(gic_handle_irq+0x2c/0x5c)
[ 246.091631] [<c0008568>] (gic_handle_irq) from [<c0011d00>]
(__irq_svc+0x40/0x50)
[ 246.092007] Exception stack(0xc09e7eb0 to 0xc09e7ef8)
[ 246.092291] 7ea0: edfe0840
00000001 2d201000 c005048c
[ 246.092719] 7ec0: 00000001 c0de209c 00000100 edfe0840 c0ddf840
c0ddf840 00000007 40000005
[ 246.093142] 7ee0: c0dde340 c09e7ef8 c00504c0 c0050208 20000113 ffffffff
[ 246.093647] [<c0011d00>] (__irq_svc) from [<c0050208>]
(rebalance_domains+0x0/0x284)
[ 246.094422] [<c0050208>] (rebalance_domains) from [<c09e6028>] (0xc09e6028)
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists