linux-kernel - Re: frequent lockups in 3.18rc4

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20141120150757.GE2542@lerouge>
Date:	Thu, 20 Nov 2014 16:08:00 +0100
From:	Frederic Weisbecker <fweisbec@...il.com>
To:	Dave Jones <davej@...hat.com>,
	Linus Torvalds <torvalds@...ux-foundation.org>,
	Linux Kernel <linux-kernel@...r.kernel.org>,
	the arch/x86 maintainers <x86@...nel.org>
Subject: Re: frequent lockups in 3.18rc4

On Mon, Nov 17, 2014 at 12:03:59PM -0500, Dave Jones wrote:
> On Sat, Nov 15, 2014 at 10:33:19PM -0800, Linus Torvalds wrote:
>  
>  > >  > I'll try that next, and check in on it tomorrow.
>  > >
>  > > No luck. Died even faster this time.
>  > 
>  > Yeah, and your other lockups haven't even been TLB related. Not that
>  > they look like anything else *either*.
>  > 
>  > I have no ideas left. I'd go for a bisection - rather than try random
>  > things, at least bisection will get us a smaller set of suspects if
>  > you can go through a few cycles of it. Even if you decide that you
>  > want to run for most of a day before you are convinced it's all good,
>  > a couple of days should get you a handful of bisection points (that's
>  > assuming you hit a couple of bad ones too that turn bad in a shorter
>  > while). And 4 or five bisections should get us from 11k commits down
>  > to the ~600 commit range. That would be a huge improvement.
> 
> Great start to the week: I decided to confirm my recollection that .17
> was ok, only to hit this within 10 minutes.
> 
> Kernel panic - not syncing: Watchdog detected hard LOCKUP on cpu 3
> CPU: 3 PID: 17176 Comm: trinity-c95 Not tainted 3.17.0+ #87
>  0000000000000000 00000000f3a61725 ffff880244606bf0 ffffffff9583e9fa
>  ffffffff95c67918 ffff880244606c78 ffffffff9583bcc0 0000000000000010
>  ffff880244606c88 ffff880244606c20 00000000f3a61725 0000000000000000
> Call Trace:
>  <NMI>  [<ffffffff9583e9fa>] dump_stack+0x4e/0x7a
>  [<ffffffff9583bcc0>] panic+0xd4/0x207
>  [<ffffffff95150908>] watchdog_overflow_callback+0x118/0x120
>  [<ffffffff95193dbe>] __perf_event_overflow+0xae/0x340
>  [<ffffffff95192230>] ? perf_event_task_disable+0xa0/0xa0
>  [<ffffffff9501a7bf>] ? x86_perf_event_set_period+0xbf/0x150
>  [<ffffffff95194be4>] perf_event_overflow+0x14/0x20
>  [<ffffffff95020676>] intel_pmu_handle_irq+0x206/0x410
>  [<ffffffff9501966b>] perf_event_nmi_handler+0x2b/0x50
>  [<ffffffff95007bb2>] nmi_handle+0xd2/0x390
>  [<ffffffff95007ae5>] ? nmi_handle+0x5/0x390
>  [<ffffffff958489b0>] ? _raw_spin_lock_irqsave+0x80/0x90
>  [<ffffffff950080a2>] default_do_nmi+0x72/0x1c0
>  [<ffffffff950082a8>] do_nmi+0xb8/0x100
>  [<ffffffff9584b9aa>] end_repeat_nmi+0x1e/0x2e
>  [<ffffffff958489b0>] ? _raw_spin_lock_irqsave+0x80/0x90
>  [<ffffffff958489b0>] ? _raw_spin_lock_irqsave+0x80/0x90
>  [<ffffffff958489b0>] ? _raw_spin_lock_irqsave+0x80/0x90
>  <<EOE>>  <IRQ>  [<ffffffff95101685>] lock_hrtimer_base.isra.18+0x25/0x50
>  [<ffffffff951019d3>] hrtimer_try_to_cancel+0x33/0x1f0

Ah that one got fixed in the merge window and in -stable, right?

>  [<ffffffff95101baa>] hrtimer_cancel+0x1a/0x30
>  [<ffffffff95113557>] tick_nohz_restart+0x17/0x90
>  [<ffffffff95114533>] __tick_nohz_full_check+0xc3/0x100
>  [<ffffffff9511457e>] nohz_full_kick_work_func+0xe/0x10
>  [<ffffffff95188894>] irq_work_run_list+0x44/0x70
>  [<ffffffff951888ea>] irq_work_run+0x2a/0x50
>  [<ffffffff9510109b>] update_process_times+0x5b/0x70
>  [<ffffffff95113325>] tick_sched_handle.isra.20+0x25/0x60
>  [<ffffffff95113801>] tick_sched_timer+0x41/0x60
>  [<ffffffff95102281>] __run_hrtimer+0x81/0x480
>  [<ffffffff951137c0>] ? tick_sched_do_timer+0xb0/0xb0
>  [<ffffffff95102977>] hrtimer_interrupt+0x117/0x270
>  [<ffffffff950346d7>] local_apic_timer_interrupt+0x37/0x60
>  [<ffffffff9584c44f>] smp_apic_timer_interrupt+0x3f/0x50
>  [<ffffffff9584a86f>] apic_timer_interrupt+0x6f/0x80
>  <EOI>  [<ffffffff950d3f3a>] ? lock_release_holdtime.part.28+0x9a/0x160
>  [<ffffffff950ef3b7>] ? rcu_is_watching+0x27/0x60
>  [<ffffffff9508cb75>] kill_pid_info+0xf5/0x130
>  [<ffffffff9508ca85>] ? kill_pid_info+0x5/0x130
>  [<ffffffff9508ccd3>] SYSC_kill+0x103/0x330
>  [<ffffffff9508cc7c>] ? SYSC_kill+0xac/0x330
>  [<ffffffff9519b592>] ? context_tracking_user_exit+0x52/0x1a0
>  [<ffffffff950d6f1d>] ? trace_hardirqs_on_caller+0x16d/0x210
>  [<ffffffff950d6fcd>] ? trace_hardirqs_on+0xd/0x10
>  [<ffffffff950137ad>] ? syscall_trace_enter+0x14d/0x330
>  [<ffffffff9508f44e>] SyS_kill+0xe/0x10
>  [<ffffffff95849b24>] tracesys+0xdd/0xe2
> Kernel Offset: 0x14000000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
> 
> It could a completely different cause for lockup, but seeing this now
> has me wondering if perhaps it's something unrelated to the kernel.
> I have recollection of running late .17rc's for days without incident,
> and I'm pretty sure .17 was ok too.  But a few weeks ago I did upgrade
> that test box to the Fedora 21 beta.  Which means I have a new gcc.
> I'm not sure I really trust 4.9.1 yet, so maybe I'll see if I can
> get 4.8 back on there and see if that's any better.
> 
> 	Dave
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@...r.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/