linux-kernel - Re: MD/RAID time out writing superblock

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <4AAE524C.2030401@rtr.ca>
Date:	Mon, 14 Sep 2009 10:25:16 -0400
From:	Mark Lord <liml@....ca>
To:	Tejun Heo <teheo@...e.de>
Cc:	Chris Webb <chris@...chsys.com>, linux-scsi@...r.kernel.org,
	Ric Wheeler <rwheeler@...hat.com>,
	Andrei Tanas <andrei@...as.ca>, NeilBrown <neilb@...e.de>,
	linux-kernel@...r.kernel.org,
	IDE/ATA development list <linux-ide@...r.kernel.org>,
	Jeff Garzik <jgarzik@...hat.com>, Mark Lord <mlord@...ox.com>
Subject: Re: MD/RAID time out writing superblock

Tejun Heo wrote:
> Mark Lord wrote:
>> Tejun Heo wrote:
>> ..
>>> Oooh, another possibility is the above continuous IDENTIFY tries.
>>> Doing things like that generally isn't a good idea because vendors
>>> don't expect IDENTIFY to be mixed regularly with normal IOs and
>>> firmwares aren't tested against that.  Even smart commands sometimes
>>> cause problems.  So, finding out the thing which is obsessed with the
>>> identity of the drive and stopping it might help.
>> ..
>>
>> Bullpucky.  That sort of thing, specifically with IDENTIFY,
>> has never been an issue.
> 
> With SMART it has.  I wouldn't be too surprised if some new firmware
> chokes on repeated IDENTIFY mixed with stream of NCQ commands.  It's
> just not something people (including vendors) do regularly.
..

Yeah, some drives really don't like SMART commands (hddtemp & smartctl).
That's a strange one, too.  Because the whole idea of SMART
is that it gets used to periodically monitor drive health.

IDENTIFY is much safer -- usually no media access after initial spin-up,
and lots of things exercise it quite regularly.

Pretty much any hdparm command triggers an IDENTIFY beforehand now,
hddtemp and smartctl both use it too.

I suspect we're missing some info from this specific failure.
Looking back at Chris's earlier posting, the whole thing started
with a FLUSH_CACHE_EXT failure.  Once that happens, all bets are
off on anything that follows.

> Everything will be running fine when suddenly:
> 
>   ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
>   ata1.00: cmd ea/00:00:00:00:00/00:00:00:00:00/a0 tag 0
>           res 40/00:00:80:17:91/00:00:37:00:00/40 Emask 0x4 (timeout)
>   ata1.00: status: { DRDY }
>   ata1: hard resetting link
>   ata1: softreset failed (device not ready)
>   ata1: hard resetting link
>   ata1: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
>   ata1.00: configured for UDMA/133
>   ata1: EH complete
>   end_request: I/O error, dev sda, sector 1465147272
>   md: super_written gets error=-5, uptodate=0
>   raid10: Disk failure on sda3, disabling device.
>   raid10: Operation continuing on 5 devices.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/