lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <20150513131146.GB25652@krava.redhat.com>
Date:	Wed, 13 May 2015 15:11:46 +0200
From:	Jiri Olsa <jolsa@...hat.com>
To:	lkml <linux-kernel@...r.kernel.org>
Cc:	x86@...nel.org, Robert Richter <rric@...nel.org>,
	Borislav Petkov <bp@...en8.de>,
	Peter Zijlstra <peterz@...radead.org>,
	Ingo Molnar <mingo@...nel.org>,
	"H. Peter Anvin" <hpa@...or.com>,
	Thomas Gleixner <tglx@...utronix.de>
Subject: [BUG] hard lockup on AMD server

hi,
I'm constantly getting hard lockups on single AMD server
triggered just by kernel build.. make -j65

I cannot reproduce on any other server, so I think this might
be HW issue.. it's vanilla kernel v4.0 server with 64 CPUs:

processor       : 63
vendor_id       : AuthenticAMD
cpu family      : 21
model           : 2
model name      : AMD Eng Sample, ZS258045TGG54_34/25/20_2/16    
stepping        : 0
microcode       : 0x6000803
cpu MHz         : 2500.000
cache size      : 2048 KB
physical id     : 3
siblings        : 16

config is attached

any thoughts? thanks,
jirka


---
[ 1623.620487] Kernel panic - not syncing: Watchdog detected hard LOCKUP on cpu 2
[ 1623.627810] CPU: 2 PID: 11376 Comm: cc1 Not tainted 4.0.0 #1
[ 1623.633541] Hardware name: AMD DRACHMA/DRACHMA, BIOS RDP1270B 04/24/2012
[ 1623.640315]  0000000000000000 00000000de4f9d50 ffff880131e45ac0 ffffffff81682917
[ 1623.647864]  0000000000000000 ffffffff818d80d8 ffff880131e45b40 ffffffff8167d8ca
[ 1623.655413]  0000000000000010 ffff880131e45b50 ffff880131e45af0 00000000de4f9d50
[ 1623.662951] Call Trace:
[ 1623.665416]  <NMI>  [<ffffffff81682917>] dump_stack+0x45/0x57
[ 1623.671308]  [<ffffffff8167d8ca>] panic+0xd0/0x204
[ 1623.676168]  [<ffffffff811318d0>] ? restart_watchdog_hrtimer+0x60/0x60
[ 1623.682829]  [<ffffffff8113198a>] watchdog_overflow_callback+0xba/0xc0
[ 1623.689407]  [<ffffffff8117308c>] __perf_event_overflow+0x9c/0x250
[ 1623.695643]  [<ffffffff81173b74>] perf_event_overflow+0x14/0x20
[ 1623.701578]  [<ffffffff8102d7b5>] x86_pmu_handle_irq+0x135/0x190
[ 1623.707636]  [<ffffffff8102be5b>] perf_event_nmi_handler+0x2b/0x50
[ 1623.713890]  [<ffffffff810191b0>] nmi_handle+0x90/0x130
[ 1623.719129]  [<ffffffff8101973a>] default_do_nmi+0x4a/0x140
[ 1623.724759]  [<ffffffff810198b8>] do_nmi+0x88/0xc0
[ 1623.729627]  [<ffffffff8168c101>] end_repeat_nmi+0x1e/0x2e
[ 1623.735189]  [<ffffffff811ebb55>] ? try_charge+0x335/0x720
[ 1623.740757]  [<ffffffff811ebb55>] ? try_charge+0x335/0x720
[ 1623.746326]  [<ffffffff811ebb55>] ? try_charge+0x335/0x720
[ 1623.751859]  <<EOE>>  [<ffffffff811ec828>] mem_cgroup_try_charge+0x98/0x110
[ 1623.758934]  [<ffffffff811acb5e>] handle_pte_fault+0xf6e/0x13c0
[ 1623.764921]  [<ffffffffa0412904>] ? xfs_iunlock+0x94/0xf0 [xfs]
[ 1623.770880]  [<ffffffff811ae324>] handle_mm_fault+0x234/0x4a0
[ 1623.776700]  [<ffffffff810650d2>] __do_page_fault+0x182/0x430
[ 1623.782539]  [<ffffffff810653b1>] do_page_fault+0x31/0x70
[ 1623.788014]  [<ffffffff8168bdc8>] page_fault+0x28/0x30
[ 1624.978726] Shutting down cpus with NMI
[ 1624.997461] Kernel Offset: 0x0 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffff9fffffff)
[ 1625.008145] drm_kms_helper: panic occurred, switching back to text console
[ 1625.086742] ---[ end Kernel panic - not syncing: Watchdog detected hard LOCKUP on cpu 2


---


[  929.029459] Kernel panic - t syncng: Wachdog dtected hard LOCKUP on cpu 23
[  929.037400] CPU: 23 PID: 29070 Comm: cc1 Not tainted 4.0.0 #1
[  929.043665] Hardware name: AMD DRACHMA/DRACHMA, BIOS RDP1270B 04/24/2012
[  929.050937]  0000000000000000 00000000dc1b80e6 ffff8801320e5ac0 ffffffff81682917
[  929.059139]  0000000000000000 ffffffff818d80d8 ffff8801320e5b40 ffffffff8167d8ca
[  929.067241]  0000000000000010 ffff8801320e5b50 ffff8801320e5af0 00000000dc1b80e6
[  929.075324] Call Trace:
[  929.077972]  <NMI>  [<ffffffff81682917>] dump_stack+0x45/0x57
[  929.084291]  [<ffffffff8167d8ca>] panic+0xd0/0x204
[  929.089566]  [<ffffffff811318d0>] ? restart_watchdog_hrtimer+0x60/0x60
[  929.096563]  [<ffffffff8113198a>] watchdog_overflow_callback+0xba/0xc0
[  929.103673]  [<ffffffff8117308c>] __perf_event_overflow+0x9c/0x250
[  929.110289]  [<ffffffff81173b74>] perf_event_overflow+0x14/0x20
[  929.116738]  [<ffffffff8102d7b5>] x86_pmu_handle_irq+0x135/0x190
[  929.123188]  [<ffffffff8102be5b>] perf_event_nmi_handler+0x2b/0x50
[  929.129779]  [<ffffffff810191b0>] nmi_handle+0x90/0x130
[  929.135369]  [<ffffffff8101973a>] default_do_nmi+0x4a/0x140
[  929.141441]  [<ffffffff810198b8>] do_nmi+0x88/0xc0
[  929.146671]  [<ffffffff8168c101>] end_repeat_nmi+0x1e/0x2e
[  929.152680]  [<ffffffff810e28dd>] ? run_timer_softirq+0x3d/0x350
[  929.159193]  [<ffffffff810e28dd>] ? run_timer_softirq+0x3d/0x350
[  929.165743]  [<ffffffff810e28dd>] ? run_timer_softirq+0x3d/0x350
[  929.172162]  <<EOE>>  <IRQ>  [<ffffffff8101e309>] ? read_tsc+0x9/0x10
[  929.179353]  [<ffffffff810ea40e>] ? ktime_get+0x3e/0xa0
[  929.185004]  [<ffffffff8104cc3d>] ? lapic_next_event+0x1d/0x30
[  929.191261]  [<ffffffff8107e2b4>] __do_softirq+0xf4/0x2d0
[  929.197114]  [<ffffffff8107e795>] irq_exit+0x125/0x130
[  929.202679]  [<ffffffff8168cc4a>] smp_apic_timer_interrupt+0x4a/0x60
[  929.209551]  [<ffffffff8168acad>] apic_timer_interrupt+0x6d/0x80
[  929.216027]  <EOI> 
[  930.312877] Shutting down cpus with NMI
[  930.327932] Kernel Offset: 0x0 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffff9fffffff)
[  930.338247] drm_kms_helper: panic occurred, switching back to text console
[  930.562120] ---[ end Kernel panic - not syncing: Watchdog detected hard LOCKUP on cpu 23


---

md-drachma-02 login: [  251.6835]erel panic - not syig: Watchdog detected hard LOCKUP on cpu 7
[  251.675650] CPU: 7 PID: 10522 Comm: cc1 Not tainted 4.0.0 #1
[  251.681457] Hardware name: AMD DRACHMA/DRACHMA, BIOS RDP1270B 04/24/2012
[  251.688239]  0000000000000000 0000000079fa379f ffff880131ee5ac0 ffffffff81682917
[  251.695769]  0000000000000000 ffffffff818d80d8 ffff880131ee5b40 ffffffff8167d8ca
[  251.703343]  0000000000000010 ffff880131ee5b50 ffff880131ee5af0 0000000079fa379f
[  251.710946] Call Trace:
[  251.713461]  <NMI>  [<ffffffff81682917>] dump_stack+0x45/0x57
[  251.719293]  [<ffffffff8167d8ca>] panic+0xd0/0x204
[  251.724154]  [<ffffffff811318d0>] ? restart_watchdog_hrtimer+0x60/0x60
[  251.730761]  [<ffffffff8113198a>] watchdog_overflow_callback+0xba/0xc0
[  251.737427]  [<ffffffff8117308c>] __perf_event_overflow+0x9c/0x250
[  251.743656]  [<ffffffff81173b74>] perf_event_overflow+0x14/0x20
[  251.749650]  [<ffffffff8102d7b5>] x86_pmu_handle_irq+0x135/0x190
[  251.755796]  [<ffffffff8102be5b>] perf_event_nmi_handler+0x2b/0x50
[  251.762012]  [<ffffffff810191b0>] nmi_handle+0x90/0x130
[  251.767296]  [<ffffffff8101973a>] default_do_nmi+0x4a/0x140
[  251.772899]  [<ffffffff810198b8>] do_nmi+0x88/0xc0
[  251.777725]  [<ffffffff8168c101>] end_repeat_nmi+0x1e/0x2e
[  251.783286]  [<ffffffff8168ac4d>] ? apic_timer_interrupt+0xd/0x80
[  251.789419]  [<ffffffff8168ac4d>] ? apic_timer_interrupt+0xd/0x80
[  251.795632]  [<ffffffff8168ac4d>] ? apic_timer_interrupt+0xd/0x80
[  251.801805]  <<EOE>> 
[  271.308688] Shutting down cpus with NMI
[  271.323729] Kernel Offset: 0x0 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffff9fffffff)
[  271.333944] drm_kms_helper: panic occurred, switching back to text console
[  271.340860] ---[ end Kernel panic - not syncing: Watchdog detected hard LOCKUP on cpu 7


---


[69064.210932] Knpanc - not syncing: Watchdog detected hard LOCKUP on cpu 19
[69064.218845] CPU: 19 PID: 45089 Comm: genksyms Not tainted 4.0.0 #1
[69064.225587] Hardware name: AMD DRACHMA/DRAC, BIOS DP1270B 04/24/2012
[69064.232880]  0000000000000000 000000008c01e56e ffff880132065ac0 ffffffff81682917
[69064.241021]  0000000000000000 ffffffff818d80d8 ffff880132065b40 ffffffff8167d8ca
[69064.249120]  0000000000000010 ffff880132065b50 ffff880132065af0 000000008c01e56e
[69064.257335] Call Trace:
[69064.260024]  <NMI>  [<ffffffff81682917>] dump_stack+0x45/0x57
[69064.266476]  [<ffffffff8167d8ca>] panic+0xd0/0x204
[69064.271707]  [<ffffffff811318d0>] ? restart_watchdog_hrtimer+0x60/0x60
[69064.278807]  [<ffffffff8113198a>] watchdog_overflow_callback+0xba/0xc0
[69064.285776]  [<ffffffff8117308c>] __perf_event_overflow+0x9c/0x250
[69064.292555]  [<ffffffff81173b74>] perf_event_overflow+0x14/0x20
[69064.299049]  [<ffffffff8102d7b5>] x86_pmu_handle_irq+0x135/0x190
[69064.305635]  [<ffffffff8102be5b>] perf_event_nmi_handler+0x2b/0x50
[69064.312445]  [<ffffffff810191b0>] nmi_handle+0x90/0x130
[69064.318180]  [<ffffffff8101973a>] default_do_nmi+0x4a/0x140
[69064.324240]  [<ffffffff810198b8>] do_nmi+0x88/0xc0
[69064.329464]  [<ffffffff8168c101>] end_repeat_nmi+0x1e/0x2e
[69064.335442]  [<ffffffff8131f7c7>] ? clear_page_c+0x7/0x10
[69064.341387]  [<ffffffff8131f7c7>] ? clear_page_c+0x7/0x10
[69064.347257]  [<ffffffff8131f7c7>] ? clear_page_c+0x7/0x10
[69064.353021]  <<EOE>>  [<ffffffff81185bea>] ? get_page_from_freelist+0x4ca/0xa10
[69064.361134]  [<ffffffff811862ca>] __alloc_pages_nodemask+0x19a/0x9e0
[69064.367985]  [<ffffffff810ad4e9>] ? pick_next_entity+0xa9/0x190
[69064.374413]  [<ffffffff811cf2df>] alloc_pages_vma+0xaf/0x200
[69064.380621]  [<ffffffff811acb36>] handle_pte_fault+0xf46/0x13c0
[69064.387083]  [<ffffffff811ff816>] ? pipe_read+0x286/0x2f0
[69064.392922]  [<ffffffff811ae324>] handle_mm_fault+0x234/0x4a0
[69064.399156]  [<ffffffff810650d2>] __do_page_fault+0x182/0x430
[69064.405305]  [<ffffffff810653b1>] do_page_fault+0x31/0x70
[69064.411281]  [<ffffffff8168bdc8>] page_fault+0x28/0x30
[69065.511621] Shutting down cpus with NMI
[69065.526472] Kernel Offset: 0x0 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffff9fffffff)
[69065.536771] drm_kms_helper: panic occurred, switching back to text console
[69065.760674] ---[ end Kernel panic - not syncing: Watchdog detected hard LOCKUP on cpu 19

View attachment "config" of type "text/plain" (134699 bytes)

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ