linux-kernel - Re: MD/RAID time out writing superblock

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <4AB67637.9060906@gmail.com>
Date:	Sun, 20 Sep 2009 12:36:39 -0600
From:	Robert Hancock <hancockrwd@...il.com>
To:	Mark Lord <liml@....ca>
CC:	Tejun Heo <teheo@...e.de>, Chris Webb <chris@...chsys.com>,
	linux-scsi@...r.kernel.org, Ric Wheeler <rwheeler@...hat.com>,
	Andrei Tanas <andrei@...as.ca>, NeilBrown <neilb@...e.de>,
	linux-kernel@...r.kernel.org,
	IDE/ATA development list <linux-ide@...r.kernel.org>,
	Jeff Garzik <jgarzik@...hat.com>, Mark Lord <mlord@...ox.com>
Subject: Re: MD/RAID time out writing superblock

On 09/17/2009 10:16 AM, Mark Lord wrote:
> Tejun Heo wrote:
>> Hello,
>>
>> Mark Lord wrote:
>>> Tejun.. do we do a FLUSH CACHE before issuing a non-NCQ command ?
>>
>> Nope.
>>
>>> If not, then I think we may need to add code to do it.
>>
>> Hmm... can you explain a bit more? That seems rather extreme to me.
> ..
>
> You may recall that I first raised this issue about a year ago,
> when my own RAID0 array (MythTV box) started showing errors very
> similar to what Chris is reporting.
>
> These were easily triggered by running hddtemp once every few seconds
> to log drive temperatures during Myth recording sessions.
>
> hddtemp uses SMART commands.
>
> The actual errors in the logs were command timeouts,
> but at this point I no longer remember which opcode was
> actually timing out. Disabling the onboard write cache
> immediately "cured" the problem, at the expense of MUCH
> slower I/O times.
>
> My theory at the time, was that some non-NCQ commands might be triggering
> an internal FLUSH CACHE within the (Hitachi) drive firmware, which then
> caused the original command to timeout in libata (due to the large amounts
> of data present in the onboard write-caches).
>
> Now that more people are playing the game, we're seeing more and more
> reports of strange interactions with smartd running in the background.

Well, unless the SMART commands are using a non-standard timeout, it'll 
be the same as the timeout for the flush cache, so the flush cache would 
have timed out too..

>
> I suspect more and more now that this is an (avoidable) interaction
> between the write-cache and the SMART opcode, and it could perhaps be
> avoided by doing a FLUSH CACHE before any SMART (or non-data command)
> opcode.
>
> Cheers
> --
> To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
> the body of a message to majordomo@...r.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/