linux-kernel - Re: sata_mv port lockup on hotplug (kernel 2.6.38.2)

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20110906034544.GB18425@mtj.dyndns.org>
Date:	Tue, 6 Sep 2011 12:45:44 +0900
From:	Tejun Heo <htejun@...il.com>
To:	Bruce Stenning <b.stenning@...igovision.com>
Cc:	Mark Lord <kernel@...savvy.com>,
	"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
	"linux-ide@...r.kernel.org" <linux-ide@...r.kernel.org>,
	Jeff Garzik <jgarzik@...ox.com>
Subject: Re: sata_mv port lockup on hotplug (kernel 2.6.38.2)

Hello,

On Fri, Sep 02, 2011 at 05:22:38PM +0100, Bruce Stenning wrote:
> Unfortunately it has so far been quite difficult to reproduce when specifically
> attempting to.  In normal use cases I reproduced it twice by unplugging a drive
> from a RAID array with redundancy intact.  This was out of around a dozen
> cycles of waiting until redundancy was restored while the unit was under load,
> popping the disk, reinserting, and triggering a RAID rebuild.

Hmm... that's unfortunate.

> I have only twice managed to trigger a lockup deliberately.  In both cases the
> tracing showed a scheduled EH which was subsequently not enacted.
> 
> How long could it take for the EH to be enacted?  In the lockups that I
> have reproduced it did not seem to have recovered minutes later, but perhaps
> if I had waited longer...?  I have noticed that error recovery sometimes backs
> off for 8s and even 33s, but it always warns when that sort of delay is coming
> up.

It should happen pretty quickly.  In such cases, fastdrain is
activated and all pending commands are aborted if they complete in 3
secs and then EH should kick in.  The backoff is from reset path only
to give breathing time for devices which take long time to spin up and
doesn't apply in this case.

> I shall continue to try to track down why the scheduled EH does not happen.

Can you please add some debug printk's to scsi_schedule_eh() and see
whether scsi_eh_wakeup() is invoked from there?  It seems likely that
the problem is caused by race conditions around
SHOST_[CANCEL_]RECOVERY flags.

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/