linux-kernel - Re: [SCSI REGRESSION] 3.10.2 or 3.10.3: arcmsr failure at bootup / early userspace transition

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <51F667C2.4020801@fastmail.fm>
Date:	Mon, 29 Jul 2013 15:01:54 +0200
From:	Bernd Schubert <bernd.schubert@...tmail.fm>
To:	Nick Alcock <nix@...eri.org.uk>
CC:	Linux Kernel Mailing List <linux-kernel@...r.kernel.org>,
	linux-scsi@...r.kernel.org,
	"Martin K. Petersen" <martin.petersen@...cle.com>,
	nick.cheng@...ca.com.tw
Subject: Re: [SCSI REGRESSION] 3.10.2 or 3.10.3: arcmsr failure at bootup
 / early userspace transition

Hi Nick,

On 07/29/2013 12:10 PM, Nick Alcock wrote:
> My server's ARC-1210 has been working fine for years, but when I
> upgraded from 3.10.1, it started failing:
>
> Instead of
>
> [    0.784044] Areca RAID Controller0: F/W V1.46 2009-01-06 & Model ARC-1210
> [    0.804028] scsi0 : Areca SATA Host Adapter RAID Controller
>   Driver Version 1.20.00.15 2010/08/05
> [...]
>
> [    4.111770] sd 7:0:0:1: [sdd] Assuming drive cache: write through
> [    4.115399] sd 7:0:0:1: [sdd] No Caching mode page present
> [    4.115401] sd 7:0:0:1: [sdd] Assuming drive cache: write through
> [    4.118081]  sdd: sdd1
> [    4.124363] sd 7:0:0:1: [sdd] No Caching mode page present
> [    4.124601] sd 7:0:0:1: [sdd] Assuming drive cache: write through
> [    4.124867] sd 7:0:0:1: [sdd] Attached SCSI removable disk
>
> I now see (timestamps and some of the right edge chopped off because not
> captured on my camera, no netconsole as this machine has all my storage
> and is my loghost, and with this bug it can't get at any of that
> storage).
>
> sd 7:0:0:1: [sdd] Assuming drive cache: write through
> sd 7:0:0:1: [sdd] No Caching mode page present
> sd 7:0:0:1: [sdd] Assuming drive cache: write through
>   sdd: sdd1
> sd 7:0:0:1: [sdd] No Caching mode page present
> sd 7:0:0:1: [sdd] Assuming drive cache: write through
> sd 7:0:0:1: [sdd] Attached SCSI removable disk
> arcmsr0: abort device command of scsi id = 0 lun = 1
> arcmsr0: abort device command of scsi id = 0 lun = 0
> arcmsr: executing bus reset eh.....num_resets=0, num_[...]
>
> arcmsr0: wait 'abort all outstanding command' timeout
> arcmsr0: executing hw bus reset ....
> arcmsr0: waiting for hw bus reset return, retry=0
> arcmsr0: waiting for hw bus reset return, retry=1
> Areca RAID Controller0: F/W V1.46 2009-01-06 & Model ARC-1210
> arcmsr: scsi  bus reset eh returns with success
> [and back to the top of the error messages again, apparently forever,
>   not that the machine would be much use without its RAID array even
>   if this loop terminated at some point, so I only gave it a couple
>   of minutes]
>
> The failure happens precisely at the moment we transition to early
> userspace, so presumably userspace I/O is failing (or something related
> to raw device access, perhaps, since the first thing it does is a
> vgscan).
>
> I haven't bisected yet (sorry, I have work to do which means this
> machine must be running right now), but nothing has changed in the
> arcmsr controller, nor in SCSI-land excepting
>
> commit 98dcc2946adbe4349ef1ef9b99873b912831edd4
> Author: Martin K. Petersen <martin.petersen@...cle.com>
> Date:   Thu Jun 6 22:15:55 2013 -0400
>
>      SCSI: sd: Update WRITE SAME heuristics
>
> so my, admittedly largely baseless, suspicions currently fall there.
>
>
> Obviously, at this point, this machine has no modules loaded (it has
> almost none loaded even when fully operational)

I tested this patch with ARC-1260 and F/W V1.49, no issues. Also, this 
patch is only in 3.10.3, but not yet in 3.10.1. And I don't think this 
commit can cause your issue at all, a failing heuristics would enable 
WRITE SAME and would cause issues with linux-md, but there shouldn't 
happen anything directly in the scsi-layer.
Which was your last working kernel version?


Thanks,
Bernd

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/