linux-kernel - Re: [SCSI][REGRESSION][BISECTED] Disk errors loop forever in 2.6.29

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20090219145844.GA22220@elte.hu>
Date:	Thu, 19 Feb 2009 15:58:44 +0100
From:	Ingo Molnar <mingo@...e.hu>
To:	Sitsofe Wheeler <sitsofe@...oo.com>
Cc:	linux-kernel@...r.kernel.org,
	Alan Stern <stern@...land.harvard.edu>,
	James Bottomley <James.Bottomley@...senPartnership.com>
Subject: Re: [SCSI][REGRESSION][BISECTED] Disk errors loop forever in 2.6.29


* Sitsofe Wheeler <sitsofe@...oo.com> wrote:

> Hi,
> 
> There appears to be a regression from 2.6.28 in how disk errors are
> handled in 2.6.29rc5 - rather than trying and eventually giving up, it
> appears to try (and report) forever. 
> 
> Here is the output where it aborts in 2.6.28:
> 
> ata2.01: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6
> ata2.01: BMDMA stat 0x65
> ata2.01: cmd c8/00:00:f7:e2:9c/00:00:00:00:00/f1 tag 0 dma 131072 in
>          res 00/00:00:00:00:00/00:00:00:00:00/00 Emask 0x2 (HSM violation)
> ata2: soft resetting link
> ata2.00: configured for UDMA/66
> ata2.01: configured for UDMA/66
> ata2: EH complete
> ata2.01: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6
> ata2.01: BMDMA stat 0x65
> ata2.01: cmd c8/00:00:f7:e2:9c/00:00:00:00:00/f1 tag 0 dma 131072 in
>          res 00/00:00:00:00:00/00:00:00:00:00/00 Emask 0x2 (HSM violation)
> ata2: soft resetting link
> ata2.00: configured for UDMA/66
> ata2.01: configured for UDMA/66
> ata2: EH complete
> sd 1:0:0:0: [sda] 7880544 512-byte hardware sectors: (4.03 GB/3.75 GiB)
> ata2.01: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6
> ata2.01: BMDMA stat 0x65
> ata2.01: cmd c8/00:00:f7:e2:9c/00:00:00:00:00/f1 tag 0 dma 131072 in
>          res 00/00:00:00:00:00/00:00:00:00:00/00 Emask 0x2 (HSM violation)
> ata2: soft resetting link
> ata2.00: configured for UDMA/66
> ata2.01: configured for UDMA/66
> ata2: EH complete
> sd 1:0:0:0: [sda] Write Protect is off
> sd 1:0:0:0: [sda] Mode Sense: 00 3a 00 00
> ata2.01: limiting speed to UDMA/44:PIO4
> ata2.01: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6
> ata2.01: BMDMA stat 0x65
> ata2.01: cmd c8/00:00:f7:e2:9c/00:00:00:00:00/f1 tag 0 dma 131072 in
>          res 00/00:00:00:00:00/00:00:00:00:00/00 Emask 0x2 (HSM violation)
> ata2: soft resetting link
> ata2.00: configured for UDMA/66
> ata2.01: configured for UDMA/44
> ata2: EH complete
> ata2.01: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6
> ata2.01: BMDMA stat 0x65
> ata2.01: cmd c8/00:00:f7:e2:9c/00:00:00:00:00/f1 tag 0 dma 131072 in
>          res 00/00:00:00:00:00/00:00:00:00:00/00 Emask 0x2 (HSM violation)
> ata2: soft resetting link
> ata2.00: configured for UDMA/66
> ata2.01: configured for UDMA/44
> ata2: EH complete
> sd 1:0:0:0: [sda] Write cache: disabled, read cache: enabled, doesn't support DPO or FUA
> sda: detected capacity change from 0 to 4034838528
> ata2.01: limiting speed to UDMA/33:PIO4
> ata2.01: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6
> ata2.01: BMDMA stat 0x65
> ata2.01: cmd c8/00:00:f7:e2:9c/00:00:00:00:00/f1 tag 0 dma 131072 in
>          res 00/00:00:00:00:00/00:00:00:00:00/00 Emask 0x2 (HSM violation)
> ata2: soft resetting link
> ata2.00: configured for UDMA/66
> ata2.01: configured for UDMA/33
> sd 1:0:1:0: [sdb] Result: hostbyte=0x00 driverbyte=0x08
> sd 1:0:1:0: [sdb] Sense Key : 0xb [current] [descriptor]
> Descriptor sense data with sense descriptors (in hex):
>         72 0b 00 00 00 00 00 0c 00 0a 80 00 00 00 00 00 
>         00 00 00 00 
> sd 1:0:1:0: [sdb] ASC=0x0 ASCQ=0x0
> end_request: I/O error, dev sdb, sector 27058935
> Buffer I/O error on device sdb2, logical block 444480
> Buffer I/O error on device sdb2, logical block 444481
> Buffer I/O error on device sdb2, logical block 444482
> Buffer I/O error on device sdb2, logical block 444483
> Buffer I/O error on device sdb2, logical block 444484
> Buffer I/O error on device sdb2, logical block 444485
> Buffer I/O error on device sdb2, logical block 444486
> Buffer I/O error on device sdb2, logical block 444487
> Buffer I/O error on device sdb2, logical block 444488
> Buffer I/O error on device sdb2, logical block 444489
> ata2: EH complete
> 
> It never gets to end_request on 2.6.29. I've bisected the problem down
> to the following:
> 
> [b60af5b0adf0da24c673598c8d3fb4d4189a15ce] [SCSI] simplify scsi_io_completion()
> 
> Author: Alan Stern <stern@...land.harvard.edu>
> Date:   Mon Nov 3 15:56:47 2008 -0500
> 
>     [SCSI] simplify scsi_io_completion()

i had SCSI problems with that area of the code, and the patch 
below fixed it. Maybe it fixes your problem too.

	Ingo

----------------------->
>From 7e4cbd14b28546e6f758d01b0d7f3f4647fada52 Mon Sep 17 00:00:00 2001
From: Alan Stern <stern@...land.harvard.edu>
Date: Tue, 17 Feb 2009 15:10:21 -0500
Subject: [PATCH] fix "scsi: aic7xxx hang since v2.6.28-rc1"

This patch (as1144c) removes scsi_end_request().  The routine had only
one caller, so it is moved inline and simplified.

In addition, if no forward progress has been made then the patch
decrements a request's retry counter before unpreparing and requeuing
it, to avoid infinite retry loops.

Signed-off-by: Alan Stern <stern@...land.harvard.edu>
Cc: James Bottomley <James.Bottomley@...senPartnership.com>
Cc: Jens Axboe <jens.axboe@...cle.com>
Cc: "Rafael J. Wysocki" <rjw@...k.pl>
Signed-off-by: Ingo Molnar <mingo@...e.hu>
---
 drivers/scsi/scsi_lib.c |   94 ++++++++++-------------------------------------
 1 files changed, 20 insertions(+), 74 deletions(-)

diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
index 940dc32..d4c6ac3 100644
--- a/drivers/scsi/scsi_lib.c
+++ b/drivers/scsi/scsi_lib.c
@@ -703,71 +703,6 @@ void scsi_run_host_queues(struct Scsi_Host *shost)
 
 static void __scsi_release_buffers(struct scsi_cmnd *, int);
 
-/*
- * Function:    scsi_end_request()
- *
- * Purpose:     Post-processing of completed commands (usually invoked at end
- *		of upper level post-processing and scsi_io_completion).
- *
- * Arguments:   cmd	 - command that is complete.
- *              error    - 0 if I/O indicates success, < 0 for I/O error.
- *              bytes    - number of bytes of completed I/O
- *		requeue  - indicates whether we should requeue leftovers.
- *
- * Lock status: Assumed that lock is not held upon entry.
- *
- * Returns:     cmd if requeue required, NULL otherwise.
- *
- * Notes:       This is called for block device requests in order to
- *              mark some number of sectors as complete.
- * 
- *		We are guaranteeing that the request queue will be goosed
- *		at some point during this call.
- * Notes:	If cmd was requeued, upon return it will be a stale pointer.
- */
-static struct scsi_cmnd *scsi_end_request(struct scsi_cmnd *cmd, int error,
-					  int bytes, int requeue)
-{
-	struct request_queue *q = cmd->device->request_queue;
-	struct request *req = cmd->request;
-
-	/*
-	 * If there are blocks left over at the end, set up the command
-	 * to queue the remainder of them.
-	 */
-	if (blk_end_request(req, error, bytes)) {
-		int leftover = (req->hard_nr_sectors << 9);
-
-		if (blk_pc_request(req))
-			leftover = req->data_len;
-
-		/* kill remainder if no retrys */
-		if (error && scsi_noretry_cmd(cmd))
-			blk_end_request(req, error, leftover);
-		else {
-			if (requeue) {
-				/*
-				 * Bleah.  Leftovers again.  Stick the
-				 * leftovers in the front of the
-				 * queue, and goose the queue again.
-				 */
-				scsi_release_buffers(cmd);
-				scsi_requeue_command(q, cmd);
-				cmd = NULL;
-			}
-			return cmd;
-		}
-	}
-
-	/*
-	 * This will goose the queue request function at the end, so we don't
-	 * need to worry about launching another command.
-	 */
-	__scsi_release_buffers(cmd, 0);
-	scsi_next_command(cmd);
-	return NULL;
-}
-
 static inline unsigned int scsi_sgtable_index(unsigned short nents)
 {
 	unsigned int index;
@@ -929,7 +864,6 @@ static void scsi_end_bidi_request(struct scsi_cmnd *cmd)
 void scsi_io_completion(struct scsi_cmnd *cmd, unsigned int good_bytes)
 {
 	int result = cmd->result;
-	int this_count;
 	struct request_queue *q = cmd->device->request_queue;
 	struct request *req = cmd->request;
 	int error = 0;
@@ -980,18 +914,30 @@ void scsi_io_completion(struct scsi_cmnd *cmd, unsigned int good_bytes)
 	SCSI_LOG_HLCOMPLETE(1, printk("%ld sectors total, "
 				      "%d bytes done.\n",
 				      req->nr_sectors, good_bytes));
-
-	/* A number of bytes were successfully read.  If there
-	 * are leftovers and there is some kind of error
-	 * (result != 0), retry the rest.
-	 */
-	if (scsi_end_request(cmd, error, good_bytes, result == 0) == NULL)
+	if (blk_end_request(req, error, good_bytes) == 0) {
+		/* This request is completely finished; start the next one */
+		scsi_next_command(cmd);
 		return;
-	this_count = blk_rq_bytes(req);
+	}
 
 	error = -EIO;
 
-	if (host_byte(result) == DID_RESET) {
+	/* The request isn't finished yet.  Figure out what to do next. */
+	if (result == 0) {
+		/* No error, so carry out the remainder of the request.
+		 * Failure to make forward progress counts against the
+		 * the number of retries.
+		 */
+		if (good_bytes > 0 || --req->retries >= 0)
+			action = ACTION_REPREP;
+		else {
+			action = ACTION_FAIL;
+			description = "Retries exhausted";
+		}
+	} else if (error && scsi_noretry_cmd(cmd)) {
+		/* Retrys are disallowed, so kill the remainder. */
+		action = ACTION_FAIL;
+	} else if (host_byte(result) == DID_RESET) {
 		/* Third party bus reset or reset for error recovery
 		 * reasons.  Just retry the command and see what
 		 * happens.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/