linux-kernel - Re: frequent lockups in 3.18rc4

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20141120150411.GD2542@lerouge>
Date:	Thu, 20 Nov 2014 16:04:13 +0100
From:	Frederic Weisbecker <fweisbec@...il.com>
To:	Dave Jones <davej@...hat.com>,
	Linus Torvalds <torvalds@...ux-foundation.org>,
	Don Zickus <dzickus@...hat.com>,
	Thomas Gleixner <tglx@...utronix.de>,
	Linux Kernel <linux-kernel@...r.kernel.org>,
	the arch/x86 maintainers <x86@...nel.org>,
	Peter Zijlstra <peterz@...radead.org>
Subject: Re: frequent lockups in 3.18rc4

On Wed, Nov 19, 2014 at 09:59:02AM -0500, Dave Jones wrote:
> On Tue, Nov 18, 2014 at 08:40:55PM -0800, Linus Torvalds wrote:
>  > On Tue, Nov 18, 2014 at 6:19 PM, Dave Jones <davej@...hat.com> wrote:
>  > >
>  > > NMI watchdog: BUG: soft lockup - CPU#2 stuck for 21s! [trinity-c42:31480]
>  > > CPU: 2 PID: 31480 Comm: trinity-c42 Not tainted 3.18.0-rc5+ #91 [loadavg: 174.61 150.35 148.64 9/411 32140]
>  > > RIP: 0010:[<ffffffff8a1798b4>]  [<ffffffff8a1798b4>] context_tracking_user_enter+0xa4/0x190
>  > > Call Trace:
>  > >  [<ffffffff8a012fc5>] syscall_trace_leave+0xa5/0x160
>  > >  [<ffffffff8a7d8624>] int_check_syscall_exit_work+0x34/0x3d
>  > 
>  > Hmm, if we are getting soft-lockups here, maybe it suggest too much exit-work.
>  > 
>  > Some TIF_NOHZ loop, perhaps? You have nohz on, don't you?
>  > 
>  > That makes me wonder: does the problem go away if you disable NOHZ?
> 
> Aparently not.
> 
> NMI watchdog: BUG: soft lockup - CPU#3 stuck for 22s! [trinity-c75:25175]
> CPU: 3 PID: 25175 Comm: trinity-c75 Not tainted 3.18.0-rc5+ #92 [loadavg: 168.72 151.72 150.38 9/410 27945]
> task: ffff8800364e44d0 ti: ffff880192d2c000 task.ti: ffff880192d2c000
> RIP: 0010:[<ffffffff94175be7>]  [<ffffffff94175be7>] context_tracking_user_exit+0x57/0x120
> RSP: 0018:ffff880192d2fee8  EFLAGS: 00000246
> RAX: 0000000000000000 RBX: 0000000100000046 RCX: 000000336ee35b47
> RDX: 0000000000000001 RSI: ffffffff94ac1e84 RDI: ffffffff94a93725
> RBP: ffff880192d2fef8 R08: 00007f9b74d0b740 R09: 0000000000000000
> R10: 0000000000000000 R11: 0000000000000246 R12: ffffffff940d8503
> R13: ffff880192d2fe98 R14: ffffffff943884e7 R15: ffff880192d2fe48
> FS:  00007f9b74d0b740(0000) GS:ffff880244600000(0000) knlGS:0000000000000000
> CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: 000000336f1b7740 CR3: 0000000229a95000 CR4: 00000000001407e0
> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000600
> Stack:
>  ffff880192d30000 0000000000080000 ffff880192d2ff78 ffffffff94012c25
>  00007f9b747a5000 00007f9b747a5068 0000000000000000 0000000000000000
>  0000000000000000 ffffffff9437b3be 0000000000000000 0000000000000000
> Call Trace:
>  [<ffffffff94012c25>] syscall_trace_enter_phase1+0x125/0x1a0
>  [<ffffffff9437b3be>] ? trace_hardirqs_on_thunk+0x3a/0x3f
>  [<ffffffff947d41bf>] tracesys+0x14/0x4a
> Code: 42 fd ff 48 c7 c7 7a 1e ac 94 e8 25 29 21 00 65 8b 04 25 34 f7 1c 00 83 f8 01 74 28 f6 c7 02 74 13 0f 1f 00 e8 bb 43 fd ff 53 9d <5b> 41 5c 5d c3 0f 1f 40 00 53 9d e8 89 42 fd ff eb ee 0f 1f 80 
> sending NMI to other CPUs:
> NMI backtrace for cpu 1
> CPU: 1 PID: 25164 Comm: trinity-c64 Not tainted 3.18.0-rc5+ #92 [loadavg: 168.72 151.72 150.38 9/410 27945]
> task: ffff88011600dbc0 ti: ffff8801a99a4000 task.ti: ffff8801a99a4000
> RIP: 0010:[<ffffffff940fb71e>]  [<ffffffff940fb71e>] generic_exec_single+0xee/0x1a0
> RSP: 0018:ffff8801a99a7d18  EFLAGS: 00000202
> RAX: 0000000000000000 RBX: ffff8801a99a7d20 RCX: 0000000000000038
> RDX: 00000000000000ff RSI: 0000000000000008 RDI: 0000000000000000
> RBP: ffff8801a99a7d78 R08: ffff880242b57ce0 R09: 0000000000000000
> R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000003
> R13: 0000000000000001 R14: ffff880083c28948 R15: ffffffff94166aa0
> FS:  00007f9b74d0b740(0000) GS:ffff880244200000(0000) knlGS:0000000000000000
> CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: 0000000000000001 CR3: 00000001d8611000 CR4: 00000000001407e0
> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000600
> Stack:
>  ffff8801a99a7d28 0000000000000000 ffffffff94166aa0 ffff880083c28948
>  0000000000000003 00000000e38f9aac ffff880083c28948 00000000ffffffff
>  0000000000000003 ffffffff94166aa0 ffff880083c28948 0000000000000001
> Call Trace:
>  [<ffffffff94166aa0>] ? perf_swevent_add+0x120/0x120
>  [<ffffffff94166aa0>] ? perf_swevent_add+0x120/0x120
>  [<ffffffff940fb89a>] smp_call_function_single+0x6a/0xe0

One thing that happens a lot in your crashes is a CPU sending IPIs. Maybe
stuck polling on csd->lock or something. But's it's not the CPU that soft
lockups. At least not the first that gets reported.

>  [<ffffffff940a172b>] ? preempt_count_sub+0x7b/0x100
>  [<ffffffff941671aa>] perf_event_read+0xca/0xd0
>  [<ffffffff94167240>] perf_event_read_value+0x90/0xe0
>  [<ffffffff941689c6>] perf_read+0x226/0x370
>  [<ffffffff942fbfb7>] ? security_file_permission+0x87/0xa0
>  [<ffffffff941eafff>] vfs_read+0x9f/0x180
>  [<ffffffff941ebbd8>] SyS_read+0x58/0xd0
>  [<ffffffff947d42c9>] tracesys_phase2+0xd4/0xd9
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/