linux-kernel - Re: [RFC] training mpath to discern between SCSI errors

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <4CBD18A0.5070206@ce.jp.nec.com>
Date:	Tue, 19 Oct 2010 13:03:44 +0900
From:	"Jun'ichi Nomura" <j-nomura@...jp.nec.com>
To:	Hannes Reinecke <hare@...e.de>
CC:	device-mapper development <dm-devel@...hat.com>,
	Kiyoshi Ueda <k-ueda@...jp.nec.com>, michaelc@...wisc.edu,
	tytso@....edu, linux-scsi@...r.kernel.org,
	Mike Snitzer <snitzer@...hat.com>, jaxboe@...ionio.com,
	jack@...e.cz, vst@...b.net, linux-kernel@...r.kernel.org,
	swhiteho@...hat.com, linux-raid@...r.kernel.org,
	linux-ide@...r.kernel.org, James.Bottomley@...e.de,
	chris.mason@...cle.com, konishi.ryusuke@....ntt.co.jp,
	linux-fsdevel@...r.kernel.org, Tejun Heo <tj@...nel.org>,
	rwheeler@...hat.com, Christoph Hellwig <hch@....de>,
	Sergei Shtylyov <sshtylyov@...sta.com>
Subject: Re: [RFC] training mpath to discern between SCSI errors

Hi Hannes,

(10/18/10 20:55), Hannes Reinecke wrote:
>> Does 'retryable' of EIO mean retryable in multipath layer?
>> If so, what is the difference between EIO and ENOLINK?
>>
> Yes, EIO is intended for errors which should be retried at the
> multipath layer. This does _not_ include transport errors, which are
> signalled by ENOLINK.
> 
> Basically, ENOLINK is a transport error, and EIO just means
> something is wrong and we weren't able to classify it properly.
> If we were, it'd be either ENOLINK or EREMOTEIO.
> 
>> I've heard of a case where just retrying within path-group is
>> preferred to (relatively costly) switching group.
>> So, if EIO (or other error code) can be used to indicate such type
>> of errors, it's nice.
>>
> Yes, that was one of the intention.

Great to hear that.

And when it comes to retrying, the next problem is who controls it.
I don't think it's good to duplicate retry logic in multipath and
underlying device like SCSI (i.e. sd retries 5 times).
So perhaps we need a way to disable (or limit) retries in underlying
device at least.

>> Also (although this might be a bit off topic from your patch),
>> can we expand such a distinction to what should be logged?
>> Currently, it's difficult to distinguish important SCSI/block errors
>> and less important ones in kernel log.
>> For example, when I get a link failure on sda, kernel prints something
>> like below, regardless of whether the I/O is recovered by multipathing or not:
>>   end_request: I/O error, dev sda, sector XXXXX
>>
> Indeed, when using the above we could be modifying the above
> message, eg by
> 
> end_request: transport error, dev sda, sector XXXXX
> 
> or
> 
> end_request: target error, dev sda, sector XXXXX
> 
> which would improve the output noticeable.

It improves but still they look like critical errors
even if multipath saves them.

When I see this:

  end_request: target error, dev sda, sector XXXXX

I can't tell whether it's a real error visible to user space
or it's just recoverred by multipath retry/failover afterwards.


>> Setting REQ_QUIET in dm-multipath could mask the message
>> but also other important ones in SCSI.
>>
> Hmm. Not sure about that, but I think the above modifications will
> be useful already.
> 
> I'll be sending an updated patch.

Thank you. I'm looking for that.

-- 
Jun'ichi Nomura, NEC Corporation
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/