linux-kernel - dying hdd causing MCE and panic (libata)

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [thread-next>] [day] [month] [year] [list]

Date:	Sun, 20 Apr 2008 11:33:09 +0200
From:	Rumi Szabolcs <rumi_ml@...m.hu>
To:	linux-kernel@...r.kernel.org
Subject: dying hdd causing MCE and panic (libata)

Hello all!

A SATA drive in one of my servers has made some final steps towards
the grave and it has put out some obvious signs of this onto the
console (ATA transactions failing) but then it has also thrown an
MCE (CPU context corrupt) and then the kernel has panicked.
This server is rock stable otherwise and used to make uptimes
measured in months between planned restarts.

The machine has been removed from power completely and restarted
multiple times but during the boot process it always crashed with
an MCE or a panic or both.

Sorry but I cannot provide exact debug information right now because
I wasn't physically there at the time and I'm still 250kms away from
that server. In fact I've remotely guided two people without a clue
through the phone and they have read things from the console for me,
restarted the machine, etc.

So in the end I told them to open up the server and pull the SATA
cable from that particular drive. Suddenly all the MCEs and panics
had gone away and the machine is running fine since then.

Hardware:

- Nforce4 based motherboard (chipset integrated SATA ports)
- Athlon64 single core CPU
- Diamondmax 9 SATA hard drive

Kernel:

2.6.23-gentoo-r3 (no preempt, no smp)

My questions:

- Is it normal that a simple hard disk failure (that is not even
the system disk) causes MCEs and kernel panics?

- Is this a problem that is induced completely on the hardware
level (eg. the southbridge going crazy and making the whole
hardware platform unstable) or a problem that could be fixed
or handled properly on the software (kernel) level?

Thanks!

Best regards,
Sab
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/