[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20140305014310.GC3334@linux.vnet.ibm.com>
Date: Tue, 4 Mar 2014 17:43:10 -0800
From: "Paul E. McKenney" <paulmck@...ux.vnet.ibm.com>
To: Florian Fainelli <f.fainelli@...il.com>
Cc: Eric Dumazet <eric.dumazet@...il.com>,
"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
linux-mm <linux-mm@...ck.org>,
linux-nfs <linux-nfs@...r.kernel.org>,
"trond.myklebust" <trond.myklebust@...marydata.com>,
netdev <netdev@...r.kernel.org>
Subject: Re: RCU stalls when running out of memory on 3.14-rc4 w/ NFS and
kernel threads priorities changed
On Tue, Mar 04, 2014 at 05:16:27PM -0800, Florian Fainelli wrote:
> 2014-03-04 17:03 GMT-08:00 Florian Fainelli <f.fainelli@...il.com>:
> > 2014-03-04 16:48 GMT-08:00 Eric Dumazet <eric.dumazet@...il.com>:
> >> On Tue, 2014-03-04 at 15:55 -0800, Florian Fainelli wrote:
> >>> Hi all,
> >>>
> >>> I am seeing the following RCU stalls messages appearing on an ARMv7
> >>> 4xCPUs system running 3.14-rc4:
> >>>
> >>> [ 42.974327] INFO: rcu_sched detected stalls on CPUs/tasks:
> >>> [ 42.979839] (detected by 0, t=2102 jiffies, g=4294967082,
> >>> c=4294967081, q=516)
> >>> [ 42.987169] INFO: Stall ended before state dump start
> >>>
> >>> this is happening under the following conditions:
> >>>
> >>> - the attached bumper.c binary alters various kernel thread priorities
> >>> based on the contents of bumpup.cfg and
> >>> - malloc_crazy is running from a NFS share
> >>> - malloc_crazy.c is running in a loop allocating chunks of memory but
> >>> never freeing it
> >>>
> >>> when the priorities are altered, instead of getting the OOM killer to
> >>> be invoked, the RCU stalls are happening. Taking NFS out of the
> >>> equation does not allow me to reproduce the problem even with the
> >>> priorities altered.
> >>>
> >>> This "problem" seems to have been there for quite a while now since I
> >>> was able to get 3.8.13 to trigger that bug as well, with a slightly
> >>> more detailed RCU debugging trace which points the finger at kswapd0.
> >>>
> >>> You should be able to get that reproduced under QEMU with the
> >>> Versatile Express platform emulating a Cortex A15 CPU and the attached
> >>> files.
> >>>
> >>> Any help or suggestions would be greatly appreciated. Thanks!
> >>
> >> Do you have a more complete trace, including stack traces ?
> >
> > Attatched is what I get out of SysRq-t, which is the only thing I have
> > (note that the kernel is built with CONFIG_RCU_CPU_STALL_INFO=y):
>
> QEMU for Versatile Express w/ 2 CPUs yields something slightly
> different than the real HW platform this is happening with, but it
> does produce the RCU stall anyway:
>
> [ 125.762946] BUG: soft lockup - CPU#1 stuck for 53s! [malloc_crazy:91]
This soft-lockup condition can result in RCU CPU stall warnings. Fix
the problem causing the soft lockup, and I bet that your RCU CPU stall
warnings go away.
Thanx, Paul
> [ 125.766841] Modules linked in:
> [ 125.768389]
> [ 125.769199] CPU: 1 PID: 91 Comm: malloc_crazy Not tainted 3.14.0-rc4 #39
> [ 125.769883] task: edbded00 ti: c089c000 task.ti: c089c000
> [ 125.771743] PC is at load_balance+0x4b0/0x760
> [ 125.772069] LR is at cpumask_next_and+0x44/0x5c
> [ 125.772387] pc : [<c004ff58>] lr : [<c01db940>] psr: 60000113
> [ 125.772387] sp : c089db48 ip : 80000113 fp : edfd8cf4
> [ 125.773128] r10: c0de871c r9 : ed893840 r8 : 00000000
> [ 125.773452] r7 : c0de8458 r6 : edfd8840 r5 : edfd8840 r4 : ed89389c
> [ 125.773825] r3 : 000012d8 r2 : 80000113 r1 : 00000023 r0 : 00000000
> [ 125.774332] Flags: nZCv IRQs on FIQs on Mode SVC_32 ISA ARM Segment user
> [ 125.774753] Control: 30c7387d Table: 80835d40 DAC: 00000000
> [ 125.775266] CPU: 1 PID: 91 Comm: malloc_test_bcm Not tainted 3.14.0-rc4 #39
> [ 125.776392] [<c0015624>] (unwind_backtrace) from [<c00111a4>]
> (show_stack+0x10/0x14)
> [ 125.777026] [<c00111a4>] (show_stack) from [<c04c1bd4>]
> (dump_stack+0x84/0x94)
> [ 125.777429] [<c04c1bd4>] (dump_stack) from [<c007e7f0>]
> (watchdog_timer_fn+0x144/0x17c)
> [ 125.777865] [<c007e7f0>] (watchdog_timer_fn) from [<c003f58c>]
> (__run_hrtimer.isra.32+0x54/0xe4)
> [ 125.778333] [<c003f58c>] (__run_hrtimer.isra.32) from [<c003fea4>]
> (hrtimer_interrupt+0x11c/0x2d0)
> [ 125.778814] [<c003fea4>] (hrtimer_interrupt) from [<c03c4f80>]
> (arch_timer_handler_virt+0x28/0x30)
> [ 125.779297] [<c03c4f80>] (arch_timer_handler_virt) from
> [<c006280c>] (handle_percpu_devid_irq+0x6c/0x84)
> [ 125.779734] [<c006280c>] (handle_percpu_devid_irq) from
> [<c005edec>] (generic_handle_irq+0x2c/0x3c)
> [ 125.780145] [<c005edec>] (generic_handle_irq) from [<c000eb7c>]
> (handle_IRQ+0x40/0x90)
> [ 125.780513] [<c000eb7c>] (handle_IRQ) from [<c0008568>]
> (gic_handle_irq+0x2c/0x5c)
> [ 125.780867] [<c0008568>] (gic_handle_irq) from [<c0011d00>]
> (__irq_svc+0x40/0x50)
> [ 125.781312] Exception stack(0xc089db00 to 0xc089db48)
> [ 125.781787] db00: 00000000 00000023 80000113 000012d8 ed89389c
> edfd8840 edfd8840 c0de8458
> [ 125.782234] db20: 00000000 ed893840 c0de871c edfd8cf4 80000113
> c089db48 c01db940 c004ff58
> [ 125.782594] db40: 60000113 ffffffff
> [ 125.782864] [<c0011d00>] (__irq_svc) from [<c004ff58>]
> (load_balance+0x4b0/0x760)
> [ 125.783215] [<c004ff58>] (load_balance) from [<c005035c>]
> (rebalance_domains+0x154/0x284)
> [ 125.783595] [<c005035c>] (rebalance_domains) from [<c00504c0>]
> (run_rebalance_domains+0x34/0x164)
> [ 125.784000] [<c00504c0>] (run_rebalance_domains) from [<c0025aac>]
> (__do_softirq+0x110/0x24c)
> [ 125.784388] [<c0025aac>] (__do_softirq) from [<c0025e6c>]
> (irq_exit+0xac/0xf4)
> [ 125.784726] [<c0025e6c>] (irq_exit) from [<c000eb80>] (handle_IRQ+0x44/0x90)
> [ 125.785059] [<c000eb80>] (handle_IRQ) from [<c0008568>]
> (gic_handle_irq+0x2c/0x5c)
> [ 125.785412] [<c0008568>] (gic_handle_irq) from [<c0011d00>]
> (__irq_svc+0x40/0x50)
> [ 125.785742] Exception stack(0xc089dcf8 to 0xc089dd40)
> [ 125.785983] dce0:
> ee4e38c0 00000000
> [ 125.786360] dd00: 000200da 00000001 ee4e38a0 c0de2340 2d201000
> edfe3358 c05c0c18 00000001
> [ 125.786737] dd20: c05c0c2c c0e1e180 00000000 c089dd40 c0086050
> c0086140 40000113 ffffffff
> [ 125.787120] [<c0011d00>] (__irq_svc) from [<c0086140>]
> (get_page_from_freelist+0x2bc/0x638)
> [ 125.787507] [<c0086140>] (get_page_from_freelist) from [<c0087018>]
> (__alloc_pages_nodemask+0x114/0x8f4)
> [ 125.787949] [<c0087018>] (__alloc_pages_nodemask) from [<c00a20c8>]
> (handle_mm_fault+0x9f8/0xcdc)
> [ 125.788357] [<c00a20c8>] (handle_mm_fault) from [<c001793c>]
> (do_page_fault+0x194/0x27c)
> [ 125.788726] [<c001793c>] (do_page_fault) from [<c000844c>]
> (do_DataAbort+0x30/0x90)
> [ 125.789080] [<c000844c>] (do_DataAbort) from [<c0011e34>]
> (__dabt_usr+0x34/0x40)
> [ 125.789403] Exception stack(0xc089dfb0 to 0xc089dff8)
> [ 125.789652] dfa0: ae55e008
> 11111111 000dc000 ae582000
> [ 125.790029] dfc0: be967d18 00000000 00008390 00000000 00000000
> 00000000 b6f29000 be967d04
> [ 125.790404] dfe0: 11111111 be967ce8 00008550 b6e5ce48 20000010 ffffffff
> [ 125.791282] BUG: soft lockup - CPU#0 stuck for 53s! [kworker/0:1:41]
> [ 125.791710] Modules linked in:
> [ 125.792046]
> [ 125.792836] CPU: 0 PID: 41 Comm: kworker/0:1 Not tainted 3.14.0-rc4 #39
> [ 125.793832] task: ed893840 ti: c0826000 task.ti: c0826000
> [ 125.795257] PC is at finish_task_switch+0x40/0xec
> [ 125.795562] LR is at __schedule+0x1fc/0x51c
> [ 125.795856] pc : [<c0044724>] lr : [<c04c3274>] psr: 600f0013
> [ 125.795856] sp : c0827ec8 ip : 00000000 fp : c0827edc
> [ 125.796416] r10: eda56e00 r9 : c0deac28 r8 : 00000000
> [ 125.796698] r7 : c0de871c r6 : c0826020 r5 : ed893840 r4 : 00000001
> [ 125.797028] r3 : eda56e00 r2 : 000012d7 r1 : ed854780 r0 : edfd8840
> [ 125.797488] Flags: nZCv IRQs on FIQs on Mode SVC_32 ISA ARM
> Segment kernel
> [ 125.797866] Control: 30c7387d Table: 808353c0 DAC: fffffffd
> [ 125.798321] CPU: 0 PID: 41 Comm: kworker/0:1 Not tainted 3.14.0-rc4 #39
> [ 125.800604] [<c0015624>] (unwind_backtrace) from [<c00111a4>]
> (show_stack+0x10/0x14)
> [ 125.801205] [<c00111a4>] (show_stack) from [<c04c1bd4>]
> (dump_stack+0x84/0x94)
> [ 125.801786] [<c04c1bd4>] (dump_stack) from [<c007e7f0>]
> (watchdog_timer_fn+0x144/0x17c)
> [ 125.802238] [<c007e7f0>] (watchdog_timer_fn) from [<c003f58c>]
> (__run_hrtimer.isra.32+0x54/0xe4)
> [ 125.802679] [<c003f58c>] (__run_hrtimer.isra.32) from [<c003fea4>]
> (hrtimer_interrupt+0x11c/0x2d0)
> [ 125.803108] [<c003fea4>] (hrtimer_interrupt) from [<c03c4f80>]
> (arch_timer_handler_virt+0x28/0x30)
> [ 125.803530] [<c03c4f80>] (arch_timer_handler_virt) from
> [<c006280c>] (handle_percpu_devid_irq+0x6c/0x84)
> [ 125.803965] [<c006280c>] (handle_percpu_devid_irq) from
> [<c005edec>] (generic_handle_irq+0x2c/0x3c)
> [ 125.804380] [<c005edec>] (generic_handle_irq) from [<c000eb7c>]
> (handle_IRQ+0x40/0x90)
> [ 125.804774] [<c000eb7c>] (handle_IRQ) from [<c0008568>]
> (gic_handle_irq+0x2c/0x5c)
> [ 125.805181] [<c0008568>] (gic_handle_irq) from [<c0011d00>]
> (__irq_svc+0x40/0x50)
> [ 125.805706] Exception stack(0xc0827e80 to 0xc0827ec8)
> [ 125.806185] 7e80: edfd8840 ed854780 000012d7 eda56e00 00000001
> ed893840 c0826020 c0de871c
> [ 125.806623] 7ea0: 00000000 c0deac28 eda56e00 c0827edc 00000000
> c0827ec8 c04c3274 c0044724
> [ 125.807032] 7ec0: 600f0013 ffffffff
> [ 125.807328] [<c0011d00>] (__irq_svc) from [<c0044724>]
> (finish_task_switch+0x40/0xec)
> [ 125.807752] [<c0044724>] (finish_task_switch) from [<c04c3274>]
> (__schedule+0x1fc/0x51c)
> [ 125.808175] [<c04c3274>] (__schedule) from [<c0037590>]
> (worker_thread+0x210/0x404)
> [ 125.808570] [<c0037590>] (worker_thread) from [<c003cba0>]
> (kthread+0xd4/0xec)
> [ 125.808952] [<c003cba0>] (kthread) from [<c000e338>]
> (ret_from_fork+0x14/0x3c)
> [ 246.080014] INFO: rcu_sched detected stalls on CPUs/tasks:
> [ 246.080611] (detected by 0, t=6972 jiffies, g=4294967160,
> c=4294967159, q=127)
> [ 246.081576] INFO: Stall ended before state dump start
> [ 246.082510] BUG: soft lockup - CPU#1 stuck for 69s! [grep:93]
> [ 246.082849] Modules linked in:
> [ 246.083046]
> [ 246.083179] CPU: 1 PID: 93 Comm: grep Not tainted 3.14.0-rc4 #39
> [ 246.083548] task: edbdf480 ti: c09e6000 task.ti: c09e6000
> [ 246.083897] PC is at rebalance_domains+0x0/0x284
> [ 246.084154] LR is at run_rebalance_domains+0x34/0x164
> [ 246.084430] pc : [<c0050208>] lr : [<c00504c0>] psr: 20000113
> [ 246.084430] sp : c09e7ef8 ip : c0dde340 fp : 40000005
> [ 246.084992] r10: 00000007 r9 : c0ddf840 r8 : c0ddf840
> [ 246.085267] r7 : edfe0840 r6 : 00000100 r5 : c0de209c r4 : 00000001
> [ 246.085606] r3 : c005048c r2 : 2d201000 r1 : 00000001 r0 : edfe0840
> [ 246.085944] Flags: nzCv IRQs on FIQs on Mode SVC_32 ISA ARM Segment user
> [ 246.086315] Control: 30c7387d Table: 808351c0 DAC: 00000000
> [ 246.086626] CPU: 1 PID: 93 Comm: grep Not tainted 3.14.0-rc4 #39
> [ 246.086992] [<c0015624>] (unwind_backtrace) from [<c00111a4>]
> (show_stack+0x10/0x14)
> [ 246.087420] [<c00111a4>] (show_stack) from [<c04c1bd4>]
> (dump_stack+0x84/0x94)
> [ 246.087825] [<c04c1bd4>] (dump_stack) from [<c007e7f0>]
> (watchdog_timer_fn+0x144/0x17c)
> [ 246.088260] [<c007e7f0>] (watchdog_timer_fn) from [<c003f58c>]
> (__run_hrtimer.isra.32+0x54/0xe4)
> [ 246.088726] [<c003f58c>] (__run_hrtimer.isra.32) from [<c003fea4>]
> (hrtimer_interrupt+0x11c/0x2d0)
> [ 246.089190] [<c003fea4>] (hrtimer_interrupt) from [<c03c4f80>]
> (arch_timer_handler_virt+0x28/0x30)
> [ 246.089670] [<c03c4f80>] (arch_timer_handler_virt) from
> [<c006280c>] (handle_percpu_devid_irq+0x6c/0x84)
> [ 246.090155] [<c006280c>] (handle_percpu_devid_irq) from
> [<c005edec>] (generic_handle_irq+0x2c/0x3c)
> [ 246.090822] [<c005edec>] (generic_handle_irq) from [<c000eb7c>]
> (handle_IRQ+0x40/0x90)
> [ 246.091233] [<c000eb7c>] (handle_IRQ) from [<c0008568>]
> (gic_handle_irq+0x2c/0x5c)
> [ 246.091631] [<c0008568>] (gic_handle_irq) from [<c0011d00>]
> (__irq_svc+0x40/0x50)
> [ 246.092007] Exception stack(0xc09e7eb0 to 0xc09e7ef8)
> [ 246.092291] 7ea0: edfe0840
> 00000001 2d201000 c005048c
> [ 246.092719] 7ec0: 00000001 c0de209c 00000100 edfe0840 c0ddf840
> c0ddf840 00000007 40000005
> [ 246.093142] 7ee0: c0dde340 c09e7ef8 c00504c0 c0050208 20000113 ffffffff
> [ 246.093647] [<c0011d00>] (__irq_svc) from [<c0050208>]
> (rebalance_domains+0x0/0x284)
> [ 246.094422] [<c0050208>] (rebalance_domains) from [<c09e6028>] (0xc09e6028)
>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists