[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <20150513131146.GB25652@krava.redhat.com>
Date: Wed, 13 May 2015 15:11:46 +0200
From: Jiri Olsa <jolsa@...hat.com>
To: lkml <linux-kernel@...r.kernel.org>
Cc: x86@...nel.org, Robert Richter <rric@...nel.org>,
Borislav Petkov <bp@...en8.de>,
Peter Zijlstra <peterz@...radead.org>,
Ingo Molnar <mingo@...nel.org>,
"H. Peter Anvin" <hpa@...or.com>,
Thomas Gleixner <tglx@...utronix.de>
Subject: [BUG] hard lockup on AMD server
hi,
I'm constantly getting hard lockups on single AMD server
triggered just by kernel build.. make -j65
I cannot reproduce on any other server, so I think this might
be HW issue.. it's vanilla kernel v4.0 server with 64 CPUs:
processor : 63
vendor_id : AuthenticAMD
cpu family : 21
model : 2
model name : AMD Eng Sample, ZS258045TGG54_34/25/20_2/16
stepping : 0
microcode : 0x6000803
cpu MHz : 2500.000
cache size : 2048 KB
physical id : 3
siblings : 16
config is attached
any thoughts? thanks,
jirka
---
[ 1623.620487] Kernel panic - not syncing: Watchdog detected hard LOCKUP on cpu 2
[ 1623.627810] CPU: 2 PID: 11376 Comm: cc1 Not tainted 4.0.0 #1
[ 1623.633541] Hardware name: AMD DRACHMA/DRACHMA, BIOS RDP1270B 04/24/2012
[ 1623.640315] 0000000000000000 00000000de4f9d50 ffff880131e45ac0 ffffffff81682917
[ 1623.647864] 0000000000000000 ffffffff818d80d8 ffff880131e45b40 ffffffff8167d8ca
[ 1623.655413] 0000000000000010 ffff880131e45b50 ffff880131e45af0 00000000de4f9d50
[ 1623.662951] Call Trace:
[ 1623.665416] <NMI> [<ffffffff81682917>] dump_stack+0x45/0x57
[ 1623.671308] [<ffffffff8167d8ca>] panic+0xd0/0x204
[ 1623.676168] [<ffffffff811318d0>] ? restart_watchdog_hrtimer+0x60/0x60
[ 1623.682829] [<ffffffff8113198a>] watchdog_overflow_callback+0xba/0xc0
[ 1623.689407] [<ffffffff8117308c>] __perf_event_overflow+0x9c/0x250
[ 1623.695643] [<ffffffff81173b74>] perf_event_overflow+0x14/0x20
[ 1623.701578] [<ffffffff8102d7b5>] x86_pmu_handle_irq+0x135/0x190
[ 1623.707636] [<ffffffff8102be5b>] perf_event_nmi_handler+0x2b/0x50
[ 1623.713890] [<ffffffff810191b0>] nmi_handle+0x90/0x130
[ 1623.719129] [<ffffffff8101973a>] default_do_nmi+0x4a/0x140
[ 1623.724759] [<ffffffff810198b8>] do_nmi+0x88/0xc0
[ 1623.729627] [<ffffffff8168c101>] end_repeat_nmi+0x1e/0x2e
[ 1623.735189] [<ffffffff811ebb55>] ? try_charge+0x335/0x720
[ 1623.740757] [<ffffffff811ebb55>] ? try_charge+0x335/0x720
[ 1623.746326] [<ffffffff811ebb55>] ? try_charge+0x335/0x720
[ 1623.751859] <<EOE>> [<ffffffff811ec828>] mem_cgroup_try_charge+0x98/0x110
[ 1623.758934] [<ffffffff811acb5e>] handle_pte_fault+0xf6e/0x13c0
[ 1623.764921] [<ffffffffa0412904>] ? xfs_iunlock+0x94/0xf0 [xfs]
[ 1623.770880] [<ffffffff811ae324>] handle_mm_fault+0x234/0x4a0
[ 1623.776700] [<ffffffff810650d2>] __do_page_fault+0x182/0x430
[ 1623.782539] [<ffffffff810653b1>] do_page_fault+0x31/0x70
[ 1623.788014] [<ffffffff8168bdc8>] page_fault+0x28/0x30
[ 1624.978726] Shutting down cpus with NMI
[ 1624.997461] Kernel Offset: 0x0 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffff9fffffff)
[ 1625.008145] drm_kms_helper: panic occurred, switching back to text console
[ 1625.086742] ---[ end Kernel panic - not syncing: Watchdog detected hard LOCKUP on cpu 2
---
[ 929.029459] Kernel panic - t syncng: Wachdog dtected hard LOCKUP on cpu 23
[ 929.037400] CPU: 23 PID: 29070 Comm: cc1 Not tainted 4.0.0 #1
[ 929.043665] Hardware name: AMD DRACHMA/DRACHMA, BIOS RDP1270B 04/24/2012
[ 929.050937] 0000000000000000 00000000dc1b80e6 ffff8801320e5ac0 ffffffff81682917
[ 929.059139] 0000000000000000 ffffffff818d80d8 ffff8801320e5b40 ffffffff8167d8ca
[ 929.067241] 0000000000000010 ffff8801320e5b50 ffff8801320e5af0 00000000dc1b80e6
[ 929.075324] Call Trace:
[ 929.077972] <NMI> [<ffffffff81682917>] dump_stack+0x45/0x57
[ 929.084291] [<ffffffff8167d8ca>] panic+0xd0/0x204
[ 929.089566] [<ffffffff811318d0>] ? restart_watchdog_hrtimer+0x60/0x60
[ 929.096563] [<ffffffff8113198a>] watchdog_overflow_callback+0xba/0xc0
[ 929.103673] [<ffffffff8117308c>] __perf_event_overflow+0x9c/0x250
[ 929.110289] [<ffffffff81173b74>] perf_event_overflow+0x14/0x20
[ 929.116738] [<ffffffff8102d7b5>] x86_pmu_handle_irq+0x135/0x190
[ 929.123188] [<ffffffff8102be5b>] perf_event_nmi_handler+0x2b/0x50
[ 929.129779] [<ffffffff810191b0>] nmi_handle+0x90/0x130
[ 929.135369] [<ffffffff8101973a>] default_do_nmi+0x4a/0x140
[ 929.141441] [<ffffffff810198b8>] do_nmi+0x88/0xc0
[ 929.146671] [<ffffffff8168c101>] end_repeat_nmi+0x1e/0x2e
[ 929.152680] [<ffffffff810e28dd>] ? run_timer_softirq+0x3d/0x350
[ 929.159193] [<ffffffff810e28dd>] ? run_timer_softirq+0x3d/0x350
[ 929.165743] [<ffffffff810e28dd>] ? run_timer_softirq+0x3d/0x350
[ 929.172162] <<EOE>> <IRQ> [<ffffffff8101e309>] ? read_tsc+0x9/0x10
[ 929.179353] [<ffffffff810ea40e>] ? ktime_get+0x3e/0xa0
[ 929.185004] [<ffffffff8104cc3d>] ? lapic_next_event+0x1d/0x30
[ 929.191261] [<ffffffff8107e2b4>] __do_softirq+0xf4/0x2d0
[ 929.197114] [<ffffffff8107e795>] irq_exit+0x125/0x130
[ 929.202679] [<ffffffff8168cc4a>] smp_apic_timer_interrupt+0x4a/0x60
[ 929.209551] [<ffffffff8168acad>] apic_timer_interrupt+0x6d/0x80
[ 929.216027] <EOI>
[ 930.312877] Shutting down cpus with NMI
[ 930.327932] Kernel Offset: 0x0 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffff9fffffff)
[ 930.338247] drm_kms_helper: panic occurred, switching back to text console
[ 930.562120] ---[ end Kernel panic - not syncing: Watchdog detected hard LOCKUP on cpu 23
---
md-drachma-02 login: [ 251.6835]erel panic - not syig: Watchdog detected hard LOCKUP on cpu 7
[ 251.675650] CPU: 7 PID: 10522 Comm: cc1 Not tainted 4.0.0 #1
[ 251.681457] Hardware name: AMD DRACHMA/DRACHMA, BIOS RDP1270B 04/24/2012
[ 251.688239] 0000000000000000 0000000079fa379f ffff880131ee5ac0 ffffffff81682917
[ 251.695769] 0000000000000000 ffffffff818d80d8 ffff880131ee5b40 ffffffff8167d8ca
[ 251.703343] 0000000000000010 ffff880131ee5b50 ffff880131ee5af0 0000000079fa379f
[ 251.710946] Call Trace:
[ 251.713461] <NMI> [<ffffffff81682917>] dump_stack+0x45/0x57
[ 251.719293] [<ffffffff8167d8ca>] panic+0xd0/0x204
[ 251.724154] [<ffffffff811318d0>] ? restart_watchdog_hrtimer+0x60/0x60
[ 251.730761] [<ffffffff8113198a>] watchdog_overflow_callback+0xba/0xc0
[ 251.737427] [<ffffffff8117308c>] __perf_event_overflow+0x9c/0x250
[ 251.743656] [<ffffffff81173b74>] perf_event_overflow+0x14/0x20
[ 251.749650] [<ffffffff8102d7b5>] x86_pmu_handle_irq+0x135/0x190
[ 251.755796] [<ffffffff8102be5b>] perf_event_nmi_handler+0x2b/0x50
[ 251.762012] [<ffffffff810191b0>] nmi_handle+0x90/0x130
[ 251.767296] [<ffffffff8101973a>] default_do_nmi+0x4a/0x140
[ 251.772899] [<ffffffff810198b8>] do_nmi+0x88/0xc0
[ 251.777725] [<ffffffff8168c101>] end_repeat_nmi+0x1e/0x2e
[ 251.783286] [<ffffffff8168ac4d>] ? apic_timer_interrupt+0xd/0x80
[ 251.789419] [<ffffffff8168ac4d>] ? apic_timer_interrupt+0xd/0x80
[ 251.795632] [<ffffffff8168ac4d>] ? apic_timer_interrupt+0xd/0x80
[ 251.801805] <<EOE>>
[ 271.308688] Shutting down cpus with NMI
[ 271.323729] Kernel Offset: 0x0 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffff9fffffff)
[ 271.333944] drm_kms_helper: panic occurred, switching back to text console
[ 271.340860] ---[ end Kernel panic - not syncing: Watchdog detected hard LOCKUP on cpu 7
---
[69064.210932] Knpanc - not syncing: Watchdog detected hard LOCKUP on cpu 19
[69064.218845] CPU: 19 PID: 45089 Comm: genksyms Not tainted 4.0.0 #1
[69064.225587] Hardware name: AMD DRACHMA/DRAC, BIOS DP1270B 04/24/2012
[69064.232880] 0000000000000000 000000008c01e56e ffff880132065ac0 ffffffff81682917
[69064.241021] 0000000000000000 ffffffff818d80d8 ffff880132065b40 ffffffff8167d8ca
[69064.249120] 0000000000000010 ffff880132065b50 ffff880132065af0 000000008c01e56e
[69064.257335] Call Trace:
[69064.260024] <NMI> [<ffffffff81682917>] dump_stack+0x45/0x57
[69064.266476] [<ffffffff8167d8ca>] panic+0xd0/0x204
[69064.271707] [<ffffffff811318d0>] ? restart_watchdog_hrtimer+0x60/0x60
[69064.278807] [<ffffffff8113198a>] watchdog_overflow_callback+0xba/0xc0
[69064.285776] [<ffffffff8117308c>] __perf_event_overflow+0x9c/0x250
[69064.292555] [<ffffffff81173b74>] perf_event_overflow+0x14/0x20
[69064.299049] [<ffffffff8102d7b5>] x86_pmu_handle_irq+0x135/0x190
[69064.305635] [<ffffffff8102be5b>] perf_event_nmi_handler+0x2b/0x50
[69064.312445] [<ffffffff810191b0>] nmi_handle+0x90/0x130
[69064.318180] [<ffffffff8101973a>] default_do_nmi+0x4a/0x140
[69064.324240] [<ffffffff810198b8>] do_nmi+0x88/0xc0
[69064.329464] [<ffffffff8168c101>] end_repeat_nmi+0x1e/0x2e
[69064.335442] [<ffffffff8131f7c7>] ? clear_page_c+0x7/0x10
[69064.341387] [<ffffffff8131f7c7>] ? clear_page_c+0x7/0x10
[69064.347257] [<ffffffff8131f7c7>] ? clear_page_c+0x7/0x10
[69064.353021] <<EOE>> [<ffffffff81185bea>] ? get_page_from_freelist+0x4ca/0xa10
[69064.361134] [<ffffffff811862ca>] __alloc_pages_nodemask+0x19a/0x9e0
[69064.367985] [<ffffffff810ad4e9>] ? pick_next_entity+0xa9/0x190
[69064.374413] [<ffffffff811cf2df>] alloc_pages_vma+0xaf/0x200
[69064.380621] [<ffffffff811acb36>] handle_pte_fault+0xf46/0x13c0
[69064.387083] [<ffffffff811ff816>] ? pipe_read+0x286/0x2f0
[69064.392922] [<ffffffff811ae324>] handle_mm_fault+0x234/0x4a0
[69064.399156] [<ffffffff810650d2>] __do_page_fault+0x182/0x430
[69064.405305] [<ffffffff810653b1>] do_page_fault+0x31/0x70
[69064.411281] [<ffffffff8168bdc8>] page_fault+0x28/0x30
[69065.511621] Shutting down cpus with NMI
[69065.526472] Kernel Offset: 0x0 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffff9fffffff)
[69065.536771] drm_kms_helper: panic occurred, switching back to text console
[69065.760674] ---[ end Kernel panic - not syncing: Watchdog detected hard LOCKUP on cpu 19
View attachment "config" of type "text/plain" (134699 bytes)
Powered by blists - more mailing lists