[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <532CA2A5.30102@st.ovgu.de>
Date: Fri, 21 Mar 2014 21:35:49 +0100
From: Matthias Graf <matthias.graf@...ovgu.de>
To: Borislav Petkov <bp@...en8.de>
CC: linux-kernel@...r.kernel.org, Tony Luck <tony.luck@...el.com>
Subject: Re: PROBLEM: Fatal Machine Check >= 3.13.5-101.fc19.x86_64
This log starts from rebooting into the failing kernel
notice lots of NULs printed after log entry:
Mar 21 21:23:06 linux rsyslogd: [origin software
and then rebooting again the working kernel.
I will try booting without gnome later (do not have more time right now).
Am 21.03.2014 21:13, schrieb Borislav Petkov:
> + Tony.
>
> Provided the decode is correct and I'm reading it right, this looks
> like the cores get to livelock for some reason without any forward
> progress. The MCEs signal that there hasn't been any instruction retired
> in relatively long time, thus a stall.
>
> You say, this happens when gnome starts. Does it also happen if you
> don't start gnome, i.e. don't start X at all? Try booting into a
> runlevel without graphics.
>
> Tony, any other ideas?
>
> Also, can you send full dmesg of both a working boot, without the MCEs
> and one with?
>
> Leaving in the rest.
>
> On Fri, Mar 21, 2014 at 08:49:51PM +0100, Matthias Graf wrote:
>> (Please CC me on all replies)
>>
>> mcelog output for all mces:
>>
>>
>>
>> Hardware event. This is not a software error.
>> CPU 3 BANK 0
>> MCG status:RIPV MCIP
>> MCi status:
>> Uncorrected error
>> Error enabled
>> Processor context corrupt
>> MCA: BUS Level-0 Local-CPU-originated-request Generic Memory-access
>> Request-did-not-timeout Error
>> BQ_DCU_READ_TYPE BQ_ERR_HARD_TYPE BQ_ERR_HARD_TYPE
>> timeout BINIT (ROB timeout). No micro-instruction retired for some time
>> STATUS b200004000000800 MCGSTATUS 5
>>
>>
>> Hardware event. This is not a software error.
>> CPU 3 BANK 5
>> MCG status:RIPV MCIP
>> MCi status:
>> Uncorrected error
>> Error enabled
>> Processor context corrupt
>> MCA: Internal Timer error
>> STATUS b200220024080400 MCGSTATUS 5
>>
>>
>> Hardware event. This is not a software error.
>> CPU 1 BANK 0
>> MCG status:MCIP
>> MCi status:
>> Uncorrected error
>> Error enabled
>> Processor context corrupt
>> MCA: BUS Level-0 Local-CPU-originated-request Generic Memory-access
>> Request-did-not-timeout Error
>> BQ_DCU_READ_TYPE BQ_ERR_HARD_TYPE BQ_ERR_HARD_TYPE
>> timeout BINIT (ROB timeout). No micro-instruction retired for some time
>> STATUS b200004000000800 MCGSTATUS 4
>>
>>
>> Hardware event. This is not a software error.
>> CPU 1 BANK 5
>> MCG status:MCIP
>> MCi status:
>> Uncorrected error
>> Error enabled
>> Processor context corrupt
>> MCA: Internal Timer error
>> STATUS b200220010040400 MCGSTATUS 4
>>
>>
>> Hardware event. This is not a software error.
>> CPU 2 BANK 0
>> MCG status:MCIP
>> MCi status:
>> Uncorrected error
>> Error enabled
>> Processor context corrupt
>> MCA: BUS Level-0 Local-CPU-originated-request Generic Memory-access
>> Request-did-not-timeout Error
>> BQ_DCU_READ_TYPE BQ_ERR_HARD_TYPE BQ_ERR_HARD_TYPE
>> timeout BINIT (ROB timeout). No micro-instruction retired for some time
>> STATUS b200004000000800 MCGSTATUS 4
>>
>>
>> Hardware event. This is not a software error.
>> CPU 2 BANK 5
>> MCG status:MCIP
>> MCi status:
>> Uncorrected error
>> Error enabled
>> Processor context corrupt
>> MCA: Internal Timer error
>> STATUS b200221010040400 MCGSTATUS 4
>>
>> Hardware event. This is not a software error.
>> CPU 0 BANK 5
>> MCG status:RIPV MCIP
>> MCi status:
>> Uncorrected error
>> Error enabled
>> Processor context corrupt
>> MCA: Internal Timer error
>> STATUS b200221024080400 MCGSTATUS 5
>>
>>
>> Hardware event. This is not a software error.
>> CPU 0 BANK 0
>> MCG status:RIPV MCIP
>> MCi status:
>> Uncorrected error
>> Error enabled
>> Processor context corrupt
>> MCA: BUS Level-0 Local-CPU-originated-request Generic Memory-access
>> Request-did-not-timeout Error
>> BQ_DCU_READ_TYPE BQ_ERR_HARD_TYPE BQ_ERR_HARD_TYPE
>> timeout BINIT (ROB timeout). No micro-instruction retired for some time
>> STATUS b200004000000800 MCGSTATUS 5
>>
>>
>>
>> Am 21.03.2014 18:27, schrieb Borislav Petkov:
>>> On Fri, Mar 21, 2014 at 06:10:23PM +0100, Matthias Graf wrote:
>>>> Please CC me on replies.
>>>>
>>>> [1.] Kernel panic: Fatal Machine Check after booting >=
>>>> 3.13.5-101.fc19.x86_64; 3.12.11-201.fc19.x86_64 works fine!
>>>> [2.] Screen freezes a few seconds after Gnome appears. The error message
>>>> (see attachement) is seldom still printed to the screen. Booting
>>>> 3.12.11-201 with otherwise the same setup, I do not see the panic.
>>>> Booting on different hardware (my laptop) does not produce the panic. I
>>>> also notice low frames per seconds after gnome started up, right before
>>>> the panic occures. I therefore suppose this is graphics hardware related.
>>>> [3.] Fatal Machine Check Exception, RIP Inexact, apic_timer_interrupt,
>>>> Kernel panic
>>>> [4.] 3.13.6-100.fc19.x86_64 && 3.13.5-103.fc19.x86 && 3.13.5-101.fc19.x86_64
>>>> [5.] OCRed: (see Attachement for photo)
>>>>
>>>> Started Accounts Service.
>>>> [ 34.348483] mce: [Hardware Error]: CPU 3: Machine Check Exception: 5 Bank 8: bZ88884888888888
>>>> [ 44.468168] mce: [Hardware Error]: HIP ?IHEXfiCT? 18:<ffffffff816881f8> {apicgtimer_interrupt+8x8/8x88}
>>>> I 44.468168] mce: [Hardware Error]: TSC 36S??8ad8c
>>>> f 44.468168] mce: [Hardware Error]: PROCESSOR 8:6fb TIM 138471666? SOCKET 8 HPIC 2 microcode ba
>>>> I 44.468168] mce: [Hardware Error]: Run the above through 'mcelog ~~ascii’
>>>
>>> This looks like you had some text recognition done on the jpeg. :-)
>>>
>>> Please correct the error message to be exactly as in the jpeg and run it
>>> through mcelog --ascii to see what that bank 8 is trying to tell us.
>>>
>>> Thanks.
>>>
>
>> [ 34.348483] mce: [Hardware Error]: CPU 3: Machine Check Exception: 5 Bank 0: b200004000000800
>> [ 44.468168] mce: [Hardware Error]: RIP !INEXACT! 10:<ffffffff816901f0> {apic_timer_interrupt+0x0/0x80}
>> [ 44.468168] mce: [Hardware Error]: TSC 365779ad0c
>> [ 44.468168] mce: [Hardware Error]: PROCESSOR 0:6fb TIME 1394716667 SOCKET 0 APIC 2 microcode ba
>> [ 44.468168] mce: [Hardware Error]: Run the above through 'mcelog --ascii'
>> [ 44.468168] mce: [Hardware Error]: CPU 3: Machine Check Exception: 5 Bank 5: b200220024080400
>> [ 44.468168] mce: [Hardware Error]: RIP !INEXACT! 10:<ffffffff816901f0> {apic_timer_interrupt+0x0/0x80}
>> [ 44.468168] mce: [Hardware Error]: TSC 365779ad0c
>> [ 44.468168] mce: [Hardware Error]: PROCESSOR 0:6fb TIME 1394716667 SOCKET 0 APIC 2 microcode ba
>> [ 44.468168] mce: [Hardware Error]: Run the above through 'mcelog --ascii'
>> [ 44.468168] mce: [Hardware Error]: CPU 1: Machine Check Exception: 4 Bank 0: b200004000000800
>> [ 44.468168] mce: [Hardware Error]: TSC 365779ad42
>> [ 44.468168] mce: [Hardware Error]: PROCESSOR 0:6fb TIME 1394716667 SOCKET 0 APIC 3 microcode ba
>> [ 44.468168] mce: [Hardware Error]: Run the above through 'mcelog --ascii'
>> [ 44.468168] mce: [Hardware Error]: CPU 1: Machine Check Exception: 4 Bank 5: b200220010040400
>> [ 44.468168] mce: [Hardware Error]: TSC 365779ad42
>> [ 44.468168] mce: [Hardware Error]: PROCESSOR 0:6fb TIME 1394716667 SOCKET 0 APIC 3 microcode ba
>> [ 44.468168] mce: [Hardware Error]: Run the above through 'mcelog --ascii'
>> [ 44.468168] mce: [Hardware Error]: CPU 2: Machine Check Exception: 4 Bank 0: b200004000000800
>> [ 44.468168] mce: [Hardware Error]: TSC 365779aeaa
>> [ 44.468168] mce: [Hardware Error]: PROCESSOR 0:6fb TIME 1394716667 SOCKET 0 APIC 1 microcode ba
>> [ 44.468168] mce: [Hardware Error]: Run the above through 'mcelog --ascii'
>> [ 44.468168] mce: [Hardware Error]: CPU 2: Machine Check Exception: 4 Bank 5: b200221010040400
>> [ 44.468168] mce: [Hardware Error]: TSC 365779aeaa
>> [ 44.468168] mce: [Hardware Error]: PROCESSOR 0:6fb TIME 1394716667 SOCKET 0 APIC 1 microcode ba
>> [ 44.468168] mce: [Hardware Error]: Run the above through 'mcelog --ascii'
>> [ 44.468168] mce: [Hardware Error]: CPU 0: Machine Check Exception: 5 Bank 5: b200221024080400
>> [ 44.468168] mce: [Hardware Error]: RIP !INEXACT! 10:<ffffffff816901f0> {apic_timer_interrupt+0x0/0x80}
>> [ 44.468168] mce: [Hardware Error]: TSC 365779aece
>> [ 44.468168] mce: [Hardware Error]: PROCESSOR 0:6fb TIME 1394716667 SOCKET 0 APIC 0 microcode ba
>> [ 44.468168] mce: [Hardware Error]: Run the above through 'mcelog --ascii'
>> [ 44.468168] mce: [Hardware Error]: CPU 0: Machine Check Exception: 5 Bank 0: b200004000000800
>> [ 44.468168] mce: [Hardware Error]: RIP !INEXACT! 10:<ffffffff816901f0> {apic_timer_interrupt+0x0/0x80}
>> [ 44.468168] mce: [Hardware Error]: TSC 365779aece
>> [ 44.468168] mce: [Hardware Error]: PROCESSOR 0:6fb TIME 1394716667 SOCKET 0 APIC 0 microcode ba
>> [ 44.468168] mce: [Hardware Error]: Run the above through 'mcelog --ascii'
>> [ 44.468168] mce: [Hardware Error]: Machine check: Processor context corrupt
>> [ 44.468168] Kernel panic — not syncing: Fatal Machine check
>> [ 44.468168] drm_kms_helper: panic occurred, switching back to text console
>> [ 44.468168] Rebooting in 30 seconds..
>
>> Hardware event. This is not a software error.
>> CPU 3 BANK 0
>> MCG status:RIPV MCIP
>> MCi status:
>> Uncorrected error
>> Error enabled
>> Processor context corrupt
>> MCA: BUS Level-0 Local-CPU-originated-request Generic Memory-access Request-did-not-timeout Error
>> BQ_DCU_READ_TYPE BQ_ERR_HARD_TYPE BQ_ERR_HARD_TYPE
>> timeout BINIT (ROB timeout). No micro-instruction retired for some time
>> STATUS b200004000000800 MCGSTATUS 5
>>
>>
>> Hardware event. This is not a software error.
>> CPU 3 BANK 5
>> MCG status:RIPV MCIP
>> MCi status:
>> Uncorrected error
>> Error enabled
>> Processor context corrupt
>> MCA: Internal Timer error
>> STATUS b200220024080400 MCGSTATUS 5
>>
>>
>> Hardware event. This is not a software error.
>> CPU 1 BANK 0
>> MCG status:MCIP
>> MCi status:
>> Uncorrected error
>> Error enabled
>> Processor context corrupt
>> MCA: BUS Level-0 Local-CPU-originated-request Generic Memory-access Request-did-not-timeout Error
>> BQ_DCU_READ_TYPE BQ_ERR_HARD_TYPE BQ_ERR_HARD_TYPE
>> timeout BINIT (ROB timeout). No micro-instruction retired for some time
>> STATUS b200004000000800 MCGSTATUS 4
>>
>>
>> Hardware event. This is not a software error.
>> CPU 1 BANK 5
>> MCG status:MCIP
>> MCi status:
>> Uncorrected error
>> Error enabled
>> Processor context corrupt
>> MCA: Internal Timer error
>> STATUS b200220010040400 MCGSTATUS 4
>>
>>
>> Hardware event. This is not a software error.
>> CPU 2 BANK 0
>> MCG status:MCIP
>> MCi status:
>> Uncorrected error
>> Error enabled
>> Processor context corrupt
>> MCA: BUS Level-0 Local-CPU-originated-request Generic Memory-access Request-did-not-timeout Error
>> BQ_DCU_READ_TYPE BQ_ERR_HARD_TYPE BQ_ERR_HARD_TYPE
>> timeout BINIT (ROB timeout). No micro-instruction retired for some time
>> STATUS b200004000000800 MCGSTATUS 4
>>
>>
>> Hardware event. This is not a software error.
>> CPU 2 BANK 5
>> MCG status:MCIP
>> MCi status:
>> Uncorrected error
>> Error enabled
>> Processor context corrupt
>> MCA: Internal Timer error
>> STATUS b200221010040400 MCGSTATUS 4
>>
>> Hardware event. This is not a software error.
>> CPU 0 BANK 5
>> MCG status:RIPV MCIP
>> MCi status:
>> Uncorrected error
>> Error enabled
>> Processor context corrupt
>> MCA: Internal Timer error
>> STATUS b200221024080400 MCGSTATUS 5
>>
>>
>> Hardware event. This is not a software error.
>> CPU 0 BANK 0
>> MCG status:RIPV MCIP
>> MCi status:
>> Uncorrected error
>> Error enabled
>> Processor context corrupt
>> MCA: BUS Level-0 Local-CPU-originated-request Generic Memory-access Request-did-not-timeout Error
>> BQ_DCU_READ_TYPE BQ_ERR_HARD_TYPE BQ_ERR_HARD_TYPE
>> timeout BINIT (ROB timeout). No micro-instruction retired for some time
>> STATUS b200004000000800 MCGSTATUS 5
>>
>
>
>
>
View attachment "messagesRelevantPart.txt" of type "text/plain" (257092 bytes)
Download attachment "signature.asc" of type "application/pgp-signature" (539 bytes)
Powered by blists - more mailing lists