linux-kernel - Re: RAID 10 w AHCI w NCQ = Spurius I/O error

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID:  <472383EA.20506@tmr.com>
Date:	Sat, 27 Oct 2007 14:31:06 -0400
From:	Bill Davidsen <davidsen@....com>
To:	linux-kernel@...r.kernel.org
Cc:	linux-kernel@...r.kernel.org
Subject:  Re: RAID 10 w AHCI w NCQ = Spurius  I/O error

Nestor A. Diaz wrote:
> Hello People,
> 
> I need your help, this problem is turning me crazy.

Did you know there is a raid list?
> 
> I have created a RAID 10 using  a RAID0 configuration on top of a two 
> RAID1 devices (all software raid), like this:

You have created a raid 0+1, raid10 is a different thing. Given your 
setup, raid10 is probably what you *should* have created.
> 
> Personalities : [raid0] [raid1]
> md4 : active raid0 md2[0] md3[1]
>      605071872 blocks 64k chunks
> 
> md0 : active raid1 sdd3[3] sda3[0] sdc3[2] sdb3[1]
>      9791552 blocks [4/4] [UUUU]
> 
> md3 : active raid1 sdd2[2](F) sdb2[0]
>      302536000 blocks [2/1] [U_]
> 
> md1 : active raid1 sdd1[3] sda1[0] sdc1[2] sdb1[1]
>      240832 blocks [4/4] [UUUU]
> 
> md2 : active raid1 sda2[0] sdc2[1]
>      302536000 blocks [2/2] [UU]
> 
> unused devices: <none>
> 
> But the sdd device sometimes fail, i have changed the hard disk, check 
> the older sata drive, reformat using mke2fs -c -c (to check for media 
> errrors both read and write, no media problems found, change the sata 
> disk and the problem remains, also with a new sata hard disk).
> 
> The systema is a supermicro server 5015-mt+ with an ich7 ahci controller

[___snip___]
> 
> The RAID 1 builds perfectly, but five days after that, the system shows a:
> 
> end_request: I/O error, dev sdd, sector 144006110
> raid1: Disk failure on sdd2, disabling device.
> Operation continuing on 1 devices
> end_request: I/O error, dev sdd, sector 144006222
> end_request: I/O error, dev sdd, sector 144268814
> RAID1 conf printout:
> --- wd:1 rd:2
> disk 0, wo:0, o:1, dev:sdb2
> disk 1, wo:1, o:0, dev:sdd2
> RAID1 conf printout:
> --- wd:1 rd:2
> disk 0, wo:0, o:1, dev:sdb2

Hardware error, almost certainly. If you're using a hub, I suspect that 
first, then cables and heat problems, then the controller, in rough 
order of likelyhood.
> 
> a week before i get (under 2.6.18) the following message:
> 
[___lots more snip___]

> 
> I have updated from 2.6.18 to 2.6.22 expecting to not have the problem, 
> but the problem remains and i didn't know what could be the problem, the 
> problem  always happen on /dev/sdd, i use LVM on top of the RAID 10 
> software device.
> 
> I am not sure if the problem was because i create the RAID10 using two 
> RAID1 devices and then do a RAID0, or should i have to be used mdadm and 
> the level 10 option ?
> 
> Any suggestions will be welcome.
> 
Do you ever get errors in partitions which are not part of the raid0+1 
setup, like md1? If not, look at your partition tables to see if you 
have any strange values there.

Are all drives at the same firmware level?

-- 
Bill Davidsen <davidsen@....com>
   "We have more to fear from the bungling of the incompetent than from
the machinations of the wicked."  - from Slashdot

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/