lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-Id: <1180744988.6830.183.camel@upup>
Date:	Fri, 01 Jun 2007 20:43:08 -0400
From:	Brad Langhorst <brad@...ghorst.com>
To:	linux-kernel@...r.kernel.org
Subject: first one disk drops our of raid, then the other - I/O error on
	same sector. raid bug?

This is a debian stable system in production. Kernel 2.6.18.
I upgraded the kernel from 2.6.8 to 2.6.18 about 2 weeks ago and saw no
problems until today.


May 30 12:58:00 bitc kernel: (scsi0:A:1:0): Unexpected busfree in
Command phase
May 30 12:58:00 bitc kernel: SEQADDR == 0x16c
May 30 12:58:00 bitc kernel: sd 0:0:1:0: SCSI error: return code =
0x00010000
May 30 12:58:00 bitc kernel: end_request: I/O error, dev sdb, sector
287322317
May 30 12:58:00 bitc kernel: raid1: Disk failure on sdb2, disabling
device. 
May 30 12:58:00 bitc kernel: ^IOperation continuing on 1 devices
May 30 12:58:01 bitc kernel: sd 0:0:1:0: SCSI error: return code =
0x00010000
May 30 12:58:01 bitc kernel: end_request: I/O error, dev sdb, sector
284758725
May 30 12:58:01 bitc kernel: RAID1 conf printout:
May 30 12:58:01 bitc kernel:  --- wd:1 rd:2
May 30 12:58:01 bitc kernel:  disk 0, wo:0, o:1, dev:sda2
May 30 12:58:01 bitc kernel:  disk 1, wo:1, o:0, dev:sdb2
May 30 12:58:01 bitc kernel: RAID1 conf printout:
May 30 12:58:01 bitc kernel:  --- wd:1 rd:2
May 30 12:58:01 bitc kernel:  disk 0, wo:0, o:1, dev:sda2

I can add /dev/sdb back

May 30 17:14:17 bitc kernel: md: md1: sync done.
May 30 17:14:17 bitc kernel: RAID1 conf printout:
May 30 17:14:17 bitc kernel:  --- wd:2 rd:2
May 30 17:14:17 bitc kernel:  disk 0, wo:0, o:1, dev:sda2
May 30 17:14:17 bitc kernel:  disk 1, wo:0, o:1, dev:sdb2

thenk /dev/sda fell out!

May 31 02:37:06 bitc kernel: (scsi0:A:0:0): Unexpected busfree in
Command phase
May 31 02:37:06 bitc kernel: SEQADDR == 0x16c
May 31 02:37:06 bitc kernel: sd 0:0:0:0: SCSI error: return code =
0x00010000
May 31 02:37:06 bitc kernel: end_request: I/O error, dev sda, sector
287322317
May 31 02:37:06 bitc kernel: raid1: Disk failure on sda2, disabling
device. 
May 31 02:37:06 bitc kernel: ^IOperation continuing on 1 devices
May 31 02:37:06 bitc kernel: sd 0:0:0:0: SCSI error: return code =
0x00010000
May 31 02:37:06 bitc kernel: end_request: I/O error, dev sda, sector
276325445
May 31 02:37:06 bitc kernel: RAID1 conf printout:
May 31 02:37:06 bitc kernel:  --- wd:1 rd:2
May 31 02:37:06 bitc kernel:  disk 0, wo:1, o:0, dev:sda2
May 31 02:37:06 bitc kernel:  disk 1, wo:0, o:1, dev:sdb2
May 31 02:37:06 bitc kernel: RAID1 conf printout:
May 31 02:37:06 bitc kernel:  --- wd:1 rd:2
May 31 02:37:06 bitc kernel:  disk 1, wo:0, o:1, dev:sdb2

Note that the first error is at the same sector for both disks
287322317(sda) = 287322317(sdb)
but the second one is at a different spot 
276325445(sda) != 284758725(sbd)

it seems unlikely that both disks have exactly the same bad sector 

Maybe a disk is failing, but smart doesn't show a big difference in ECC
errors between the two disks.  Seems wrong that the kernel raid should
first drop one disk, then the other if one disk is really failing.

Is this a bug?

Brad



-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ