lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [day] [month] [year] [list]
Date:	Sun, 29 Jul 2007 08:04:17 -0700 (PDT)
From:	"Hendrik ." <chasake@...oo.com>
To:	linux-kernel@...r.kernel.org
Subject: Reading a bad sector does not report failure as 'read error' but hangs PC with 'Machine Check Exception'

Last night I discovered a problem in my RAID5 array
and finally after a lot of tests I narrowed it down to
a bad sector on one of the hard disks and some goofy
kernels.

I just yesterday build a new PC using an existing
array of 5 disks in RAID 5. I did build the array with
only 4 out of 5 disks in the system but the rebuild
processes stopped over and over again apparently at
the same position. At last I found out that the
harddisk at the first SATA port had developed some bad
sectors which made the kernel stop completely when it
tried to read that sector with the following error on
the screen:

HARDWARE ERROR
CPU 0: Machine Check Exception: 4  Bank 4:
b200000000070f0f
TSC b7d4a144d0
This is not a software problem!
Run through mcelog --ascii to decode and contact your
hardware vendor
Kernel panic - not syncing: Machine check

Googling around made me check memory, upgrade the BIOS
and things like that but now i DO think that this IS a
software problem, which is in the linux kernel.

I was running the standard 2.6.20-16 kernel series
from Ubuntu Feisty Fawn (using the generic and server
built) and I built my own 2.6.22.1 but the problem
still persisted. When copying manually with dd_rescue
I was not able to copy past the bad sector or the MCE
error reappeared. Only when using the standard Ubuntu
Edgy Eft kernel (2.6.17-12-server) the problem went
away completely and the syslog was filled with normal
lines like: 

Jul 28 22:58:26 mediaserver kernel: [ 6562.446868]
ata2: error=0x40 { UncorrectableError }
Jul 28 22:58:26 mediaserver kernel: [ 6562.446875] sd
1:0:0:0: SCSI error: return code = 0x8000002
Jul 28 22:58:26 mediaserver kernel: [ 6562.446880]    
Additional sense: Unrecovered read error - auto
reallocate failed
Jul 28 22:58:26 mediaserver kernel: [ 6562.446887]
end_request: I/O error, dev sda, sector 205534870

So in the end I was able to copy my stuff off the bad
harddisk to a new disk (losing some bytes because of
my already dirty RAID5 array) but I do think this is a
kernel bug or at least strange behavior as an old
kernel is willing to continue operation on something
'minor' as a bad sector. In the end when I will start
scrubbing the drive array overnight a simple bad
sector on the array will take down the complete system
instead of just continuing with 1 faulty drive in the
array!

Some information about the hardware:
AMD Athlon 64 3000+
Asus A8N-E Deluxe motherboard
1 GB RAM
4 Seagate 7200.9 drives on the NVIDIA SATA controller
(sda ... sdd)
2 WD drives on the IDE controller (hda, hdc)
Running Feisty Fawn 64 bit Server edition

Faulty drive is /dev/sda and on thus on the first SATA
port. Changing this to a different port on the
motherboard gives the same lockup. There is also a SIL
3114 controller on the motherboard but I have not
tried to dd_rescue with the faulty drive on that
controller to see if it locks up the kernel.

Regards,
Hendrik van den Boogaard




      ____________________________________________________________________________________
Luggage? GPS? Comic books? 
Check out fitting gifts for grads at Yahoo! Search
http://search.yahoo.com/search?fr=oni_on_mail&p=graduation+gifts&cs=bz
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ