linux-kernel - Re: futex funkiness -- massive lockups

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [day] [month] [year] [list]

Message-ID: <20140305090113.GE2705@gmail.com>
Date:	Wed, 5 Mar 2014 10:01:13 +0100
From:	Ingo Molnar <mingo@...nel.org>
To:	Davidlohr Bueso <davidlohr@...com>
Cc:	tglx@...utronix.de, dvhart@...ux.intel.com, peterz@...radead.org,
	paulmck@...ux.vnet.ibm.com, torvalds@...ux-foundation.org,
	linux-kernel@...r.kernel.org
Subject: Re: futex funkiness -- massive lockups


* Davidlohr Bueso <davidlohr@...com> wrote:

> Hi,
> 
> A large amount of lockups are seen on a 480 core system doing some sort
> of database-like workload. All except one are soft lockups. This is a
> SLES11 system with most of the recent futex changes backported,
> including commits 63b1a816, b0c29f79, 99b60ce6, a52b89eb, 0d00c7b2,
> 5cdec2d8 and f12d5bfc.
> 
> The following are some traces I put together in chronological order from
> the report I received. While the traces aren't perfect, I believe it
> exemplifies the issue pretty well. There are a lot more, but just of the
> same.
> 
> [212046.044098] Kernel panic - not syncing: Watchdog detected hard LOCKUP on cpu 22
> [212046.044098] Pid: 312554, comm: XXX Tainted: GF     D W  N  3.0.101-0.15-default #1
> [212046.044098] Call Trace:
> [212046.044098]  [<ffffffff81004935>] dump_trace+0x75/0x310
> [212046.044098]  [<ffffffff8145e0b3>] dump_stack+0x69/0x6f
> [212046.044098]  [<ffffffff8145e14c>] panic+0x93/0x201
> [212046.044098]  [<ffffffff810c65e4>] watchdog_overflow_callback+0xb4/0xc0
> [212046.044098]  [<ffffffff810f2d9a>] __perf_event_overflow+0xaa/0x230
> [212046.044098]  [<ffffffff81018210>] intel_pmu_handle_irq+0x1a0/0x330
> [212046.044098]  [<ffffffff81462ae1>] perf_event_nmi_handler+0x31/0xa0
> [212046.044098]  [<ffffffff81464c37>] notifier_call_chain+0x37/0x70
> [212046.044098]  [<ffffffff81464c7d>] __atomic_notifier_call_chain+0xd/0x20
> [212046.044098]  [<ffffffff81464ccd>] notify_die+0x2d/0x40
> [212046.044098]  [<ffffffff81462127>] default_do_nmi+0x37/0x200
> [212046.044098]  [<ffffffff81462358>] do_nmi+0x68/0x80
> [212046.044098]  [<ffffffff814618ad>] restart_nmi+0x1a/0x1e

Is this end of the traceback, i.e. does the first anomalous lockup 
show that the NMI interrupted user-space mode? If yes then that's 
highly unusual.

The 'GF D W' taint also suggests that there was something going on 
before this triggered: 'W' suggests that something warned before, 'D' 
suggests something died anomalously before and 'F' suggests a forced 
or unsigned module.

So even the earliest traces look like after effects.

Thanks,

	Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/