lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:	Tue, 6 Sep 2011 12:45:44 +0900
From:	Tejun Heo <htejun@...il.com>
To:	Bruce Stenning <b.stenning@...igovision.com>
Cc:	Mark Lord <kernel@...savvy.com>,
	"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
	"linux-ide@...r.kernel.org" <linux-ide@...r.kernel.org>,
	Jeff Garzik <jgarzik@...ox.com>
Subject: Re: sata_mv port lockup on hotplug (kernel 2.6.38.2)

Hello,

On Fri, Sep 02, 2011 at 05:22:38PM +0100, Bruce Stenning wrote:
> Unfortunately it has so far been quite difficult to reproduce when specifically
> attempting to.  In normal use cases I reproduced it twice by unplugging a drive
> from a RAID array with redundancy intact.  This was out of around a dozen
> cycles of waiting until redundancy was restored while the unit was under load,
> popping the disk, reinserting, and triggering a RAID rebuild.

Hmm... that's unfortunate.

> I have only twice managed to trigger a lockup deliberately.  In both cases the
> tracing showed a scheduled EH which was subsequently not enacted.
> 
> How long could it take for the EH to be enacted?  In the lockups that I
> have reproduced it did not seem to have recovered minutes later, but perhaps
> if I had waited longer...?  I have noticed that error recovery sometimes backs
> off for 8s and even 33s, but it always warns when that sort of delay is coming
> up.

It should happen pretty quickly.  In such cases, fastdrain is
activated and all pending commands are aborted if they complete in 3
secs and then EH should kick in.  The backoff is from reset path only
to give breathing time for devices which take long time to spin up and
doesn't apply in this case.

> I shall continue to try to track down why the scheduled EH does not happen.

Can you please add some debug printk's to scsi_schedule_eh() and see
whether scsi_eh_wakeup() is invoked from there?  It seems likely that
the problem is caused by race conditions around
SHOST_[CANCEL_]RECOVERY flags.

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ