linux-kernel - [BUG] machine check Oops on Alpha

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [thread-next>] [day] [month] [year] [list]

Message-ID: <20160417210532.GA27208@gherkin.frus.com>
Date:	Sun, 17 Apr 2016 16:05:32 -0500
From:	Bob Tracy <rct@...rkin.frus.com>
To:	linux-kernel@...r.kernel.org
Cc:	debian-alpha@...ts.debian.org, mcree@...on.net.nz,
	jay.estabrook@...il.com, mattst88@...il.com
Subject: [BUG] machine check Oops on Alpha

Apologies in advance for the "poor" quality of this bug report.  No idea
how to proceed, because the issue historically has been intermittent to
non-existant for reasons unknown.

Within 24 hours of booting my Alpha (PWS 433au), I'm pretty much
guaranteed to see a "machine check" Oops which typically will occur
during a period of high disk activity (for example, during an "apt-get
update / upgrade".  If I want a huge mess to clean up afterward, "git
pull" on the kernel source tree will generally suffice as well :-(.

As long as the "Oops" trace doesn't include evidence of filesystem write
activity (calls to ext3/4 functions), the machine is perfectly stable
afterward for as long as I care to let it run -- days, weeks, whatever
-- no further Oopses will occur, regardless of how hard I flog the
machine.  A "bad" Oops will cause an immediate system lockup if any
process attempts to access the region of disk that was active at the
time the Oops occurred.

While a "machine check" is normally indicative of an underlying hardware
issue, the fact this is a one-time-per-boot issue has me thinking
otherwise.  I suspect a code path being traversed prior to the Oops that
gets bypassed afterward.  As previously mentioned, there have been months-
long intervals in the past where the issue has either been masked or non-
existent.  Currently, the issue has persisted through several 4.X kernel
release candidates and releases.

Attached is an example of precisely what I'm talking about as far as a
"good" Oops.  It occurred within a day of the last reboot, and the
machine has been running fine since.  Been flogging the devil out of it,
too: lots of updates (hundreds of megabytes), kernel builds, etc.

While any and all help tracking this down will be appreciated, please
know that kernel rebuilds (to turn on debugging or for whatever reason)
are an overnight affair on this system.  In other words, turnaround time
on diagnostic iterations involving kernel modifications will be slow.

--Bob

View attachment "good_oops" of type "text/plain" (3716 bytes)