linux-kernel - Re: Scsi errors with Megaraid 300-8x

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <44EC73D2.9090302@rtr.ca>
Date:	Wed, 23 Aug 2006 11:27:14 -0400
From:	Mark Lord <lkml@....ca>
To:	Johan Groth <johan.groth@...ux-grotto.org.uk>
Cc:	linux-kernel@...r.kernel.org
Subject: Re: Scsi errors with Megaraid 300-8x

Johan Groth wrote:
> Hi,
> ever since I upgraded my server from a dual Opteron 244 (mobo Tyan 2885) 
> system to a dual dual-core Opteron 285 (mobo Tyan 2895) system, I'm 
> getting read errors that freezes the system which leads to my disk based 
> backup software stopped working (faubackup). I think it is faubackup 
> that triggers the bug.
> 
> I get these errors in the log:
> Aug 20 06:35:08 jaguar kernel: sd 2:1:0:0: SCSI error: return code = 
> 0x40001
> Aug 20 06:35:56 jaguar kernel: end_request: I/O error, dev sda, sector 
> 616924530
> Aug 20 06:36:03 jaguar kernel: sd 2:1:0:0: SCSI error: return code = 
> 0x40001
> Aug 20 06:36:03 jaguar kernel: end_request: I/O error, dev sda, sector 
> 616924538
..
> Aug 20 06:36:07 jaguar kernel: sd 2:1:0:0: SCSI error: return code = 
> 0x40001
> Aug 20 06:36:07 jaguar kernel: end_request: I/O error, dev sda, sector 
> 616924538
> 
> The last sector is repeated until I reboot the machine. The only 
> difference I've made to the raid configuration is that sdc is now 2x250 
> MB instead of 4x120MB, but that array is the target not the source (sda).
> The raid HW is an LSI Megaraid 300-8x with the following configuration:
..

That looks like the classic SCSI bad-sectory non-recovery bug.
The code in scsi_lib.c, scsi_error.c, and sd.c is currently a
bit of a mess here.  

Basically, given an I/O request for 200 sectors, with a bad sector
in the middle at number 100, what SCSI will often do is fail sectors
number 1 through 100, one at a time, retrying the entire remainder of
the request after each attempt.  This takes hours, and results in no
data for the first 99 good sectors.

What it needs to do *instead*, is retry each sector individually,
rather than the entire request.  This would result in sectors 1..99
and 101..200 succeeding, and retries/failure only for sector 100.

A slight optimization would be to fail the bio size around sector 100,
rather than just the one sector.

I've got patches that do exactly this, and they work quite well.
But they're probably not "pretty enough" for inclusion.

Cheers

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/