linux-kernel - Re: RCU stalls when running out of memory on 3.14-rc4 w/ NFS and kernel threads priorities changed

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAGVrzcbRycBy0w64R9pV=JG6M3aJeARbOnh-xRrumYzzVDgWGQ@mail.gmail.com>
Date:	Tue, 4 Mar 2014 17:16:27 -0800
From:	Florian Fainelli <f.fainelli@...il.com>
To:	Eric Dumazet <eric.dumazet@...il.com>
Cc:	"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
	linux-mm <linux-mm@...ck.org>,
	paulmck <paulmck@...ux.vnet.ibm.com>,
	linux-nfs <linux-nfs@...r.kernel.org>,
	"trond.myklebust" <trond.myklebust@...marydata.com>,
	netdev <netdev@...r.kernel.org>
Subject: Re: RCU stalls when running out of memory on 3.14-rc4 w/ NFS and
 kernel threads priorities changed

2014-03-04 17:03 GMT-08:00 Florian Fainelli <f.fainelli@...il.com>:
> 2014-03-04 16:48 GMT-08:00 Eric Dumazet <eric.dumazet@...il.com>:
>> On Tue, 2014-03-04 at 15:55 -0800, Florian Fainelli wrote:
>>> Hi all,
>>>
>>> I am seeing the following RCU stalls messages appearing on an ARMv7
>>> 4xCPUs system running 3.14-rc4:
>>>
>>> [   42.974327] INFO: rcu_sched detected stalls on CPUs/tasks:
>>> [   42.979839]  (detected by 0, t=2102 jiffies, g=4294967082,
>>> c=4294967081, q=516)
>>> [   42.987169] INFO: Stall ended before state dump start
>>>
>>> this is happening under the following conditions:
>>>
>>> - the attached bumper.c binary alters various kernel thread priorities
>>> based on the contents of bumpup.cfg and
>>> - malloc_crazy is running from a NFS share
>>> - malloc_crazy.c is running in a loop allocating chunks of memory but
>>> never freeing it
>>>
>>> when the priorities are altered, instead of getting the OOM killer to
>>> be invoked, the RCU stalls are happening. Taking NFS out of the
>>> equation does not allow me to reproduce the problem even with the
>>> priorities altered.
>>>
>>> This "problem" seems to have been there for quite a while now since I
>>> was able to get 3.8.13 to trigger that bug as well, with a slightly
>>> more detailed RCU debugging trace which points the finger at kswapd0.
>>>
>>> You should be able to get that reproduced under QEMU with the
>>> Versatile Express platform emulating a Cortex A15 CPU and the attached
>>> files.
>>>
>>> Any help or suggestions would be greatly appreciated. Thanks!
>>
>> Do you have a more complete trace, including stack traces ?
>
> Attatched is what I get out of SysRq-t, which is the only thing I have
> (note that the kernel is built with CONFIG_RCU_CPU_STALL_INFO=y):

QEMU for Versatile Express w/ 2 CPUs yields something slightly
different than the real HW platform this is happening with, but it
does produce the RCU stall anyway:

[  125.762946] BUG: soft lockup - CPU#1 stuck for 53s! [malloc_crazy:91]
[  125.766841] Modules linked in:
[  125.768389]
[  125.769199] CPU: 1 PID: 91 Comm: malloc_crazy Not tainted 3.14.0-rc4 #39
[  125.769883] task: edbded00 ti: c089c000 task.ti: c089c000
[  125.771743] PC is at load_balance+0x4b0/0x760
[  125.772069] LR is at cpumask_next_and+0x44/0x5c
[  125.772387] pc : [<c004ff58>]    lr : [<c01db940>]    psr: 60000113
[  125.772387] sp : c089db48  ip : 80000113  fp : edfd8cf4
[  125.773128] r10: c0de871c  r9 : ed893840  r8 : 00000000
[  125.773452] r7 : c0de8458  r6 : edfd8840  r5 : edfd8840  r4 : ed89389c
[  125.773825] r3 : 000012d8  r2 : 80000113  r1 : 00000023  r0 : 00000000
[  125.774332] Flags: nZCv  IRQs on  FIQs on  Mode SVC_32  ISA ARM  Segment user
[  125.774753] Control: 30c7387d  Table: 80835d40  DAC: 00000000
[  125.775266] CPU: 1 PID: 91 Comm: malloc_test_bcm Not tainted 3.14.0-rc4 #39
[  125.776392] [<c0015624>] (unwind_backtrace) from [<c00111a4>]
(show_stack+0x10/0x14)
[  125.777026] [<c00111a4>] (show_stack) from [<c04c1bd4>]
(dump_stack+0x84/0x94)
[  125.777429] [<c04c1bd4>] (dump_stack) from [<c007e7f0>]
(watchdog_timer_fn+0x144/0x17c)
[  125.777865] [<c007e7f0>] (watchdog_timer_fn) from [<c003f58c>]
(__run_hrtimer.isra.32+0x54/0xe4)
[  125.778333] [<c003f58c>] (__run_hrtimer.isra.32) from [<c003fea4>]
(hrtimer_interrupt+0x11c/0x2d0)
[  125.778814] [<c003fea4>] (hrtimer_interrupt) from [<c03c4f80>]
(arch_timer_handler_virt+0x28/0x30)
[  125.779297] [<c03c4f80>] (arch_timer_handler_virt) from
[<c006280c>] (handle_percpu_devid_irq+0x6c/0x84)
[  125.779734] [<c006280c>] (handle_percpu_devid_irq) from
[<c005edec>] (generic_handle_irq+0x2c/0x3c)
[  125.780145] [<c005edec>] (generic_handle_irq) from [<c000eb7c>]
(handle_IRQ+0x40/0x90)
[  125.780513] [<c000eb7c>] (handle_IRQ) from [<c0008568>]
(gic_handle_irq+0x2c/0x5c)
[  125.780867] [<c0008568>] (gic_handle_irq) from [<c0011d00>]
(__irq_svc+0x40/0x50)
[  125.781312] Exception stack(0xc089db00 to 0xc089db48)
[  125.781787] db00: 00000000 00000023 80000113 000012d8 ed89389c
edfd8840 edfd8840 c0de8458
[  125.782234] db20: 00000000 ed893840 c0de871c edfd8cf4 80000113
c089db48 c01db940 c004ff58
[  125.782594] db40: 60000113 ffffffff
[  125.782864] [<c0011d00>] (__irq_svc) from [<c004ff58>]
(load_balance+0x4b0/0x760)
[  125.783215] [<c004ff58>] (load_balance) from [<c005035c>]
(rebalance_domains+0x154/0x284)
[  125.783595] [<c005035c>] (rebalance_domains) from [<c00504c0>]
(run_rebalance_domains+0x34/0x164)
[  125.784000] [<c00504c0>] (run_rebalance_domains) from [<c0025aac>]
(__do_softirq+0x110/0x24c)
[  125.784388] [<c0025aac>] (__do_softirq) from [<c0025e6c>]
(irq_exit+0xac/0xf4)
[  125.784726] [<c0025e6c>] (irq_exit) from [<c000eb80>] (handle_IRQ+0x44/0x90)
[  125.785059] [<c000eb80>] (handle_IRQ) from [<c0008568>]
(gic_handle_irq+0x2c/0x5c)
[  125.785412] [<c0008568>] (gic_handle_irq) from [<c0011d00>]
(__irq_svc+0x40/0x50)
[  125.785742] Exception stack(0xc089dcf8 to 0xc089dd40)
[  125.785983] dce0:
    ee4e38c0 00000000
[  125.786360] dd00: 000200da 00000001 ee4e38a0 c0de2340 2d201000
edfe3358 c05c0c18 00000001
[  125.786737] dd20: c05c0c2c c0e1e180 00000000 c089dd40 c0086050
c0086140 40000113 ffffffff
[  125.787120] [<c0011d00>] (__irq_svc) from [<c0086140>]
(get_page_from_freelist+0x2bc/0x638)
[  125.787507] [<c0086140>] (get_page_from_freelist) from [<c0087018>]
(__alloc_pages_nodemask+0x114/0x8f4)
[  125.787949] [<c0087018>] (__alloc_pages_nodemask) from [<c00a20c8>]
(handle_mm_fault+0x9f8/0xcdc)
[  125.788357] [<c00a20c8>] (handle_mm_fault) from [<c001793c>]
(do_page_fault+0x194/0x27c)
[  125.788726] [<c001793c>] (do_page_fault) from [<c000844c>]
(do_DataAbort+0x30/0x90)
[  125.789080] [<c000844c>] (do_DataAbort) from [<c0011e34>]
(__dabt_usr+0x34/0x40)
[  125.789403] Exception stack(0xc089dfb0 to 0xc089dff8)
[  125.789652] dfa0:                                     ae55e008
11111111 000dc000 ae582000
[  125.790029] dfc0: be967d18 00000000 00008390 00000000 00000000
00000000 b6f29000 be967d04
[  125.790404] dfe0: 11111111 be967ce8 00008550 b6e5ce48 20000010 ffffffff
[  125.791282] BUG: soft lockup - CPU#0 stuck for 53s! [kworker/0:1:41]
[  125.791710] Modules linked in:
[  125.792046]
[  125.792836] CPU: 0 PID: 41 Comm: kworker/0:1 Not tainted 3.14.0-rc4 #39
[  125.793832] task: ed893840 ti: c0826000 task.ti: c0826000
[  125.795257] PC is at finish_task_switch+0x40/0xec
[  125.795562] LR is at __schedule+0x1fc/0x51c
[  125.795856] pc : [<c0044724>]    lr : [<c04c3274>]    psr: 600f0013
[  125.795856] sp : c0827ec8  ip : 00000000  fp : c0827edc
[  125.796416] r10: eda56e00  r9 : c0deac28  r8 : 00000000
[  125.796698] r7 : c0de871c  r6 : c0826020  r5 : ed893840  r4 : 00000001
[  125.797028] r3 : eda56e00  r2 : 000012d7  r1 : ed854780  r0 : edfd8840
[  125.797488] Flags: nZCv  IRQs on  FIQs on  Mode SVC_32  ISA ARM
Segment kernel
[  125.797866] Control: 30c7387d  Table: 808353c0  DAC: fffffffd
[  125.798321] CPU: 0 PID: 41 Comm: kworker/0:1 Not tainted 3.14.0-rc4 #39
[  125.800604] [<c0015624>] (unwind_backtrace) from [<c00111a4>]
(show_stack+0x10/0x14)
[  125.801205] [<c00111a4>] (show_stack) from [<c04c1bd4>]
(dump_stack+0x84/0x94)
[  125.801786] [<c04c1bd4>] (dump_stack) from [<c007e7f0>]
(watchdog_timer_fn+0x144/0x17c)
[  125.802238] [<c007e7f0>] (watchdog_timer_fn) from [<c003f58c>]
(__run_hrtimer.isra.32+0x54/0xe4)
[  125.802679] [<c003f58c>] (__run_hrtimer.isra.32) from [<c003fea4>]
(hrtimer_interrupt+0x11c/0x2d0)
[  125.803108] [<c003fea4>] (hrtimer_interrupt) from [<c03c4f80>]
(arch_timer_handler_virt+0x28/0x30)
[  125.803530] [<c03c4f80>] (arch_timer_handler_virt) from
[<c006280c>] (handle_percpu_devid_irq+0x6c/0x84)
[  125.803965] [<c006280c>] (handle_percpu_devid_irq) from
[<c005edec>] (generic_handle_irq+0x2c/0x3c)
[  125.804380] [<c005edec>] (generic_handle_irq) from [<c000eb7c>]
(handle_IRQ+0x40/0x90)
[  125.804774] [<c000eb7c>] (handle_IRQ) from [<c0008568>]
(gic_handle_irq+0x2c/0x5c)
[  125.805181] [<c0008568>] (gic_handle_irq) from [<c0011d00>]
(__irq_svc+0x40/0x50)
[  125.805706] Exception stack(0xc0827e80 to 0xc0827ec8)
[  125.806185] 7e80: edfd8840 ed854780 000012d7 eda56e00 00000001
ed893840 c0826020 c0de871c
[  125.806623] 7ea0: 00000000 c0deac28 eda56e00 c0827edc 00000000
c0827ec8 c04c3274 c0044724
[  125.807032] 7ec0: 600f0013 ffffffff
[  125.807328] [<c0011d00>] (__irq_svc) from [<c0044724>]
(finish_task_switch+0x40/0xec)
[  125.807752] [<c0044724>] (finish_task_switch) from [<c04c3274>]
(__schedule+0x1fc/0x51c)
[  125.808175] [<c04c3274>] (__schedule) from [<c0037590>]
(worker_thread+0x210/0x404)
[  125.808570] [<c0037590>] (worker_thread) from [<c003cba0>]
(kthread+0xd4/0xec)
[  125.808952] [<c003cba0>] (kthread) from [<c000e338>]
(ret_from_fork+0x14/0x3c)
[  246.080014] INFO: rcu_sched detected stalls on CPUs/tasks:
[  246.080611]  (detected by 0, t=6972 jiffies, g=4294967160,
c=4294967159, q=127)
[  246.081576] INFO: Stall ended before state dump start
[  246.082510] BUG: soft lockup - CPU#1 stuck for 69s! [grep:93]
[  246.082849] Modules linked in:
[  246.083046]
[  246.083179] CPU: 1 PID: 93 Comm: grep Not tainted 3.14.0-rc4 #39
[  246.083548] task: edbdf480 ti: c09e6000 task.ti: c09e6000
[  246.083897] PC is at rebalance_domains+0x0/0x284
[  246.084154] LR is at run_rebalance_domains+0x34/0x164
[  246.084430] pc : [<c0050208>]    lr : [<c00504c0>]    psr: 20000113
[  246.084430] sp : c09e7ef8  ip : c0dde340  fp : 40000005
[  246.084992] r10: 00000007  r9 : c0ddf840  r8 : c0ddf840
[  246.085267] r7 : edfe0840  r6 : 00000100  r5 : c0de209c  r4 : 00000001
[  246.085606] r3 : c005048c  r2 : 2d201000  r1 : 00000001  r0 : edfe0840
[  246.085944] Flags: nzCv  IRQs on  FIQs on  Mode SVC_32  ISA ARM  Segment user
[  246.086315] Control: 30c7387d  Table: 808351c0  DAC: 00000000
[  246.086626] CPU: 1 PID: 93 Comm: grep Not tainted 3.14.0-rc4 #39
[  246.086992] [<c0015624>] (unwind_backtrace) from [<c00111a4>]
(show_stack+0x10/0x14)
[  246.087420] [<c00111a4>] (show_stack) from [<c04c1bd4>]
(dump_stack+0x84/0x94)
[  246.087825] [<c04c1bd4>] (dump_stack) from [<c007e7f0>]
(watchdog_timer_fn+0x144/0x17c)
[  246.088260] [<c007e7f0>] (watchdog_timer_fn) from [<c003f58c>]
(__run_hrtimer.isra.32+0x54/0xe4)
[  246.088726] [<c003f58c>] (__run_hrtimer.isra.32) from [<c003fea4>]
(hrtimer_interrupt+0x11c/0x2d0)
[  246.089190] [<c003fea4>] (hrtimer_interrupt) from [<c03c4f80>]
(arch_timer_handler_virt+0x28/0x30)
[  246.089670] [<c03c4f80>] (arch_timer_handler_virt) from
[<c006280c>] (handle_percpu_devid_irq+0x6c/0x84)
[  246.090155] [<c006280c>] (handle_percpu_devid_irq) from
[<c005edec>] (generic_handle_irq+0x2c/0x3c)
[  246.090822] [<c005edec>] (generic_handle_irq) from [<c000eb7c>]
(handle_IRQ+0x40/0x90)
[  246.091233] [<c000eb7c>] (handle_IRQ) from [<c0008568>]
(gic_handle_irq+0x2c/0x5c)
[  246.091631] [<c0008568>] (gic_handle_irq) from [<c0011d00>]
(__irq_svc+0x40/0x50)
[  246.092007] Exception stack(0xc09e7eb0 to 0xc09e7ef8)
[  246.092291] 7ea0:                                     edfe0840
00000001 2d201000 c005048c
[  246.092719] 7ec0: 00000001 c0de209c 00000100 edfe0840 c0ddf840
c0ddf840 00000007 40000005
[  246.093142] 7ee0: c0dde340 c09e7ef8 c00504c0 c0050208 20000113 ffffffff
[  246.093647] [<c0011d00>] (__irq_svc) from [<c0050208>]
(rebalance_domains+0x0/0x284)
[  246.094422] [<c0050208>] (rebalance_domains) from [<c09e6028>] (0xc09e6028)
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/