lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:	Fri, 30 Sep 2011 00:10:49 -0400
From:	Pavel Ivanov <paivanof@...il.com>
To:	linux-kernel <linux-kernel@...r.kernel.org>
Subject: Re: Full lockup when compiling kernel with "optimal" number of
 threads

On Fri, 2011-09-02 at 23:34 -0400, Pavel Ivanov wrote:
> Hi,
> 
> I can reliably reproduce a complete machine lockup when compiling
> kernel sources with "make -j". After making some progress machine
> stops responding to anything (including CapsLock/NumLock switching or
> mouse moving) and after hard reboot nothing is left in kern.log or
> syslog. Only attaching a serial console gives me the following clues
> to what happens:
> 
> [  376.460584] INFO: task cc1:6839 blocked for more than 60 seconds.
> [  376.533411] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
> disables this message.
> [  376.627129] INFO: task cc1:6840 blocked for more than 60 seconds.
> [  376.699991] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
> disables this message.
> [  376.793636] INFO: task cc1:6850 blocked for more than 60 seconds.
> [  376.866397] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
> disables this message.
> [  376.960026] INFO: task cc1:7017 blocked for more than 60 seconds.
> [  377.032776] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
> disables this message.
> [  377.128156] INFO: task cc1:7079 blocked for more than 60 seconds.
> [  377.200907] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
> disables this message.
> [  377.294522] INFO: task cc1:7188 blocked for more than 60 seconds.
> [  377.367274] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
> disables this message.
> [  377.460984] INFO: task cc1:8342 blocked for more than 60 seconds.
> [  377.533746] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
> disables this message.
> [  377.627372] INFO: task cc1:8425 blocked for more than 60 seconds.
> [  377.700119] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
> disables this message.
> [  377.793737] INFO: task cc1:8502 blocked for more than 60 seconds.
> [  377.866488] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
> disables this message.
> [  377.960103] INFO: task cc1:8535 blocked for more than 60 seconds.
> [  378.034788] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
> disables this message.


I've investigated this problem a little more. It turned our that
problem happens on ext4 filesystem too (initially I reported it on
ecryptfs and thought that it was somehow related to that). And it
looks like machine is not completely stalled, looks like it still
manages to do some work (repeated reports from hung task detector are
a little different). But I still can't understand what happens there.
I was able to catch some stack traces via the serial console (kernel
is 3.1-rc8 with minor changes). I've attached everything I caught. In
short: there's a lot of tasks that are repeatedly reported hung with
the following stack:

[  264.428209]   [<ffffffff8109d5cf>] schedule+0x3f/0x60
[  264.488832]   [<ffffffff81650e85>] schedule_timeout+0x335/0x3c0
[  264.559760]   [<ffffffff8164fa4a>] ? wait_for_common+0x4a/0x180
[  264.630695]   [<ffffffff8164fa4a>] ? wait_for_common+0x4a/0x180
[  264.701621]   [<ffffffff8164fb0f>] ? wait_for_common+0x10f/0x180
[  264.773592]   [<ffffffff8164fb17>] wait_for_common+0x117/0x180
[  264.843481]   [<ffffffff8109c720>] ? try_to_wake_up+0x2f0/0x2f0
[  264.914414]   [<ffffffff8164fc5d>] wait_for_completion+0x1d/0x20
[  264.986383]   [<ffffffff810a4091>] do_fork+0x1b1/0x380
[  265.047957]   [<ffffffff810bc292>] ? set_current_blocked+0x52/0x60
[  265.122002]   [<ffffffff8105e505>] sys_vfork+0x25/0x30
[  265.183573]   [<ffffffff8165af63>] stub_vfork+0x13/0x20
[  265.246188]   [<ffffffff8165ab82>] ? system_call_fastpath+0x16/0x1b

Among others there's khugepaged, compiz and unity processes apparently
waiting on disk read and even this:

[  514.443268] INFO: rcu_preempt_state detected stalls on CPUs/tasks: {} (detected by 0, t=6002 jiffies)
[  514.443460] Stack:
[  514.443462]  ffff88013bc03d38
[  514.443464]  ffffffff8135a222
[  514.443465]  000000205d343732
[  514.443467]  00000000000003e9
[  514.443469]  0000000000001000
[  514.443470]  ffffffff81cd6e00
[  514.443472]  0000000000000400
[  514.443473]  0000000000000096
[  514.443475]  ffff88013bc03d48
[  514.443477]  ffffffff8135a13e
[  514.443478]  ffff88013bc03d68
[  514.443480]  ffffffff810722d2
[  514.443482] Call Trace:
[  514.443484]  <IRQ>
[  514.443485]   [<ffffffff8135a222>] delay_tsc+0x82/0xf0
[  514.443491]   [<ffffffff8135a13e>] __const_udelay+0x2e/0x30
[  514.443495]   [<ffffffff810722d2>] native_safe_apic_wait_icr_idle+0x22/0x50
[  514.443500]   [<ffffffff81073483>] default_send_IPI_mask_sequence_phys+0x103/0x110
[  514.443506]   [<ffffffff81077607>] physflat_send_IPI_all+0x17/0x20
[  514.443510]   [<ffffffff8107363a>] arch_trigger_all_cpu_backtrace+0x5a/0x90
[  514.443514]   [<ffffffff811238bf>] __rcu_pending+0x37f/0x3e0
[  514.443519]   [<ffffffff81123f02>] rcu_check_callbacks+0x132/0x1b0
[  514.443523]   [<ffffffff810b7698>] update_process_times+0x48/0x90
[  514.443528]   [<ffffffff810dc900>] tick_sched_timer+0x60/0xc0
[  514.443534]   [<ffffffff810ce794>] __run_hrtimer+0x74/0x250
[  514.443537]   [<ffffffff810dc8a0>] ? tick_nohz_handler+0x100/0x100
[  514.443541]   [<ffffffff810cf273>] hrtimer_interrupt+0x103/0x230
[  514.443544]   [<ffffffff8165d026>] smp_apic_timer_interrupt+0x66/0x98
[  514.443549]   [<ffffffff8165b6f3>] apic_timer_interrupt+0x73/0x80
[  514.443554]  <EOI>
[  514.443555]   [<ffffffff813a55df>] ? intel_idle+0xdf/0x140
[  514.443559]   [<ffffffff813a55db>] ? intel_idle+0xdb/0x140
[  514.443563]   [<ffffffff8151a660>] cpuidle_idle_call+0xc0/0x240
[  514.443568]   [<ffffffff81054235>] cpu_idle+0xd5/0x140
[  514.443572]   [<ffffffff8162bb21>] rest_init+0xd5/0xe4
[  514.443575]   [<ffffffff8162ba4c>] ? csum_partial_copy_generic+0x16c/0x16c
[  514.443578]   [<ffffffff81cf4c36>] start_kernel+0x3e1/0x3ec
[  514.443583]   [<ffffffff81cf4347>] x86_64_start_reservations+0x132/0x136
[  514.443587]   [<ffffffff81cf4140>] ? early_idt_handlers+0x140/0x140
[  514.443590]   [<ffffffff81cf444d>] x86_64_start_kernel+0x102/0x111

(and then all CPUs stacks are inside cpuidle_idle_call)


So can somebody suggest how can I debug this problem further and
pinpoint the reason of such freeze? Or maybe someone has ideas on
what's the culprit?


Thank you,
Pavel


View attachment "traces.txt" of type "text/plain" (278485 bytes)

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ