linux-kernel - RE: Kernel Panic with Rawtherapee (mce related)

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <1331756331.5315.19.camel@erde.fritz.box>
Date:	Wed, 14 Mar 2012 21:18:51 +0100
From:	Adalbert Dawid <dawid@...ux.net>
To:	"Luck, Tony" <tony.luck@...el.com>
Cc:	Borislav Petkov <bp@...64.org>,
	"Srivatsa S. Bhat" <srivatsa.bhat@...ux.vnet.ibm.com>,
	"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
	"mingo@...e.hu" <mingo@...e.hu>, "x86@...nel.org" <x86@...nel.org>
Subject: RE: Kernel Panic with Rawtherapee (mce related)

Thank you for the quick reply.

On Wed, 2012-03-14 at 17:51 +0000, Luck, Tony wrote:
> > You're getting a bunch of machine checks, the last one of them being
> > fatal (Process Context Corrupt bit is set) causing the machine to panic.
> 
> PCC is set in all of them
> 
> > Tony will probably be able to help you further in decoding what exactly
> > those MC0_STATUS and MC5_STATUS values mean
> 
> Bank 5 ends in 0400 - which means "Internal timer error". Bank 0 has 0800
> which is a bus/interconnect error where this processor was the source of
> a memory transaction.
> 
> That's where the facts end - speculation begins here ...
> 
> Since this is repeatable under load - it's possible that a page table got
> corrupted and you are trying to access some non-existent memory location?
> Do all traces for this panic involve *_tlb_* functions?

Since the screenshot I had posted is the only one I have been able to
capture, I don't know. I will try to provoke the crash by setting the
machine under load utilizing rawtherapee and will post results in case
of success. Cpuburn did not manage to crash the machine in a (shortish)
test I did a few days ago.

It would be very helpful to disable the "reboot in 30 seconds" timeout.
Is that possible somehow?

> Or perhaps you have a cooling problem - and when stressed your cpu or
> memory is getting too hot?

I do not believe this is true as the cpu fan plus two case fans are
running fine and the sensors display cpu tempratures <60°C, even under
load.

Up to now, it has always been rawtherapee that crashed the machine. This
is why I thought it might possibly be some special cpu feature (an SSE
command or something) that happens to be broken in my cpu and that is
triggered only by rawtherapee and not by any other software. What is
your opinion on this theory? 

> -Tony
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/