linux-kernel - Re: Full lockup when compiling kernel with "optimal" number of threads

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <1317355849.9088.5.camel@PavelComp>
Date:	Fri, 30 Sep 2011 00:10:49 -0400
From:	Pavel Ivanov <paivanof@...il.com>
To:	linux-kernel <linux-kernel@...r.kernel.org>
Subject: Re: Full lockup when compiling kernel with "optimal" number of
 threads

On Fri, 2011-09-02 at 23:34 -0400, Pavel Ivanov wrote:
> Hi,
> 
> I can reliably reproduce a complete machine lockup when compiling
> kernel sources with "make -j". After making some progress machine
> stops responding to anything (including CapsLock/NumLock switching or
> mouse moving) and after hard reboot nothing is left in kern.log or
> syslog. Only attaching a serial console gives me the following clues
> to what happens:
> 
> [  376.460584] INFO: task cc1:6839 blocked for more than 60 seconds.
> [  376.533411] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
> disables this message.
> [  376.627129] INFO: task cc1:6840 blocked for more than 60 seconds.
> [  376.699991] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
> disables this message.
> [  376.793636] INFO: task cc1:6850 blocked for more than 60 seconds.
> [  376.866397] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
> disables this message.
> [  376.960026] INFO: task cc1:7017 blocked for more than 60 seconds.
> [  377.032776] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
> disables this message.
> [  377.128156] INFO: task cc1:7079 blocked for more than 60 seconds.
> [  377.200907] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
> disables this message.
> [  377.294522] INFO: task cc1:7188 blocked for more than 60 seconds.
> [  377.367274] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
> disables this message.
> [  377.460984] INFO: task cc1:8342 blocked for more than 60 seconds.
> [  377.533746] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
> disables this message.
> [  377.627372] INFO: task cc1:8425 blocked for more than 60 seconds.
> [  377.700119] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
> disables this message.
> [  377.793737] INFO: task cc1:8502 blocked for more than 60 seconds.
> [  377.866488] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
> disables this message.
> [  377.960103] INFO: task cc1:8535 blocked for more than 60 seconds.
> [  378.034788] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
> disables this message.


I've investigated this problem a little more. It turned our that
problem happens on ext4 filesystem too (initially I reported it on
ecryptfs and thought that it was somehow related to that). And it
looks like machine is not completely stalled, looks like it still
manages to do some work (repeated reports from hung task detector are
a little different). But I still can't understand what happens there.
I was able to catch some stack traces via the serial console (kernel
is 3.1-rc8 with minor changes). I've attached everything I caught. In
short: there's a lot of tasks that are repeatedly reported hung with
the following stack:

[  264.428209]   [<ffffffff8109d5cf>] schedule+0x3f/0x60
[  264.488832]   [<ffffffff81650e85>] schedule_timeout+0x335/0x3c0
[  264.559760]   [<ffffffff8164fa4a>] ? wait_for_common+0x4a/0x180
[  264.630695]   [<ffffffff8164fa4a>] ? wait_for_common+0x4a/0x180
[  264.701621]   [<ffffffff8164fb0f>] ? wait_for_common+0x10f/0x180
[  264.773592]   [<ffffffff8164fb17>] wait_for_common+0x117/0x180
[  264.843481]   [<ffffffff8109c720>] ? try_to_wake_up+0x2f0/0x2f0
[  264.914414]   [<ffffffff8164fc5d>] wait_for_completion+0x1d/0x20
[  264.986383]   [<ffffffff810a4091>] do_fork+0x1b1/0x380
[  265.047957]   [<ffffffff810bc292>] ? set_current_blocked+0x52/0x60
[  265.122002]   [<ffffffff8105e505>] sys_vfork+0x25/0x30
[  265.183573]   [<ffffffff8165af63>] stub_vfork+0x13/0x20
[  265.246188]   [<ffffffff8165ab82>] ? system_call_fastpath+0x16/0x1b

Among others there's khugepaged, compiz and unity processes apparently
waiting on disk read and even this:

[  514.443268] INFO: rcu_preempt_state detected stalls on CPUs/tasks: {} (detected by 0, t=6002 jiffies)
[  514.443460] Stack:
[  514.443462]  ffff88013bc03d38
[  514.443464]  ffffffff8135a222
[  514.443465]  000000205d343732
[  514.443467]  00000000000003e9
[  514.443469]  0000000000001000
[  514.443470]  ffffffff81cd6e00
[  514.443472]  0000000000000400
[  514.443473]  0000000000000096
[  514.443475]  ffff88013bc03d48
[  514.443477]  ffffffff8135a13e
[  514.443478]  ffff88013bc03d68
[  514.443480]  ffffffff810722d2
[  514.443482] Call Trace:
[  514.443484]  <IRQ>
[  514.443485]   [<ffffffff8135a222>] delay_tsc+0x82/0xf0
[  514.443491]   [<ffffffff8135a13e>] __const_udelay+0x2e/0x30
[  514.443495]   [<ffffffff810722d2>] native_safe_apic_wait_icr_idle+0x22/0x50
[  514.443500]   [<ffffffff81073483>] default_send_IPI_mask_sequence_phys+0x103/0x110
[  514.443506]   [<ffffffff81077607>] physflat_send_IPI_all+0x17/0x20
[  514.443510]   [<ffffffff8107363a>] arch_trigger_all_cpu_backtrace+0x5a/0x90
[  514.443514]   [<ffffffff811238bf>] __rcu_pending+0x37f/0x3e0
[  514.443519]   [<ffffffff81123f02>] rcu_check_callbacks+0x132/0x1b0
[  514.443523]   [<ffffffff810b7698>] update_process_times+0x48/0x90
[  514.443528]   [<ffffffff810dc900>] tick_sched_timer+0x60/0xc0
[  514.443534]   [<ffffffff810ce794>] __run_hrtimer+0x74/0x250
[  514.443537]   [<ffffffff810dc8a0>] ? tick_nohz_handler+0x100/0x100
[  514.443541]   [<ffffffff810cf273>] hrtimer_interrupt+0x103/0x230
[  514.443544]   [<ffffffff8165d026>] smp_apic_timer_interrupt+0x66/0x98
[  514.443549]   [<ffffffff8165b6f3>] apic_timer_interrupt+0x73/0x80
[  514.443554]  <EOI>
[  514.443555]   [<ffffffff813a55df>] ? intel_idle+0xdf/0x140
[  514.443559]   [<ffffffff813a55db>] ? intel_idle+0xdb/0x140
[  514.443563]   [<ffffffff8151a660>] cpuidle_idle_call+0xc0/0x240
[  514.443568]   [<ffffffff81054235>] cpu_idle+0xd5/0x140
[  514.443572]   [<ffffffff8162bb21>] rest_init+0xd5/0xe4
[  514.443575]   [<ffffffff8162ba4c>] ? csum_partial_copy_generic+0x16c/0x16c
[  514.443578]   [<ffffffff81cf4c36>] start_kernel+0x3e1/0x3ec
[  514.443583]   [<ffffffff81cf4347>] x86_64_start_reservations+0x132/0x136
[  514.443587]   [<ffffffff81cf4140>] ? early_idt_handlers+0x140/0x140
[  514.443590]   [<ffffffff81cf444d>] x86_64_start_kernel+0x102/0x111

(and then all CPUs stacks are inside cpuidle_idle_call)


So can somebody suggest how can I debug this problem further and
pinpoint the reason of such freeze? Or maybe someone has ideas on
what's the culprit?


Thank you,
Pavel


View attachment "traces.txt" of type "text/plain" (278485 bytes)