linux-kernel - Re: MD/RAID time out writing superblock

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <4AB23B17.2040204@rtr.ca>
Date:	Thu, 17 Sep 2009 09:35:19 -0400
From:	Mark Lord <liml@....ca>
To:	Tejun Heo <tj@...nel.org>
Cc:	Chris Webb <chris@...chsys.com>, Ric Wheeler <rwheeler@...hat.com>,
	Andrei Tanas <andrei@...as.ca>, NeilBrown <neilb@...e.de>,
	linux-kernel@...r.kernel.org,
	IDE/ATA development list <linux-ide@...r.kernel.org>,
	linux-scsi@...r.kernel.org, Jeff Garzik <jgarzik@...hat.com>,
	Mark Lord <mlord@...ox.com>
Subject: Re: MD/RAID time out writing superblock

Tejun Heo wrote:
> Hello,
> 
> Chris Webb wrote:
>> Hi Tejun. Thanks for following up to this. We've done some more
>> experimentation over the last couple of days based on your
>> suggestions and thoughts.
>>
>> Tejun Heo <tj@...nel.org> writes:
>>> Seriously, it's most likely a hardware malfunction although I can't tell
>>> where the problem is with the given data.  Get the hardware fixed.
>> We know this isn't caused by a single faulty piece of hardware,
>> because we have a cluster of identical machines and all have shown
>> this behaviour. This doesn't mean that there isn't a hardware
>> problem, but if there is one, it's a design problem or firmware bug
>> affecting all of our hosts.
> 
> If it's multiple machines, it's much less likely to be faulty drives,
> but if the machines are configured mostly identically, hardware
> problems can't be ruled out either.
> 
>> There have also been a few reports of problems which look very
>> similar in this thread from people with somewhat different hardware
>> and drives to ours.
> 
> I wouldn't connect the reported cases too eagerly at this point.  Too
> many different causes end up showing similar symptoms especially with
> timeouts.
> 
>>> The aboves are IDENTIFY.  Who's issuing IDENTIFY regularly?  It isn't
>>> from the regular IO paths or md.  It's probably being issued via SG_IO
>>> from userland.  These failures don't affect normal operation.
>> [...]
>>> Oooh, another possibility is the above continuous IDENTIFY tries.
>>> Doing things like that generally isn't a good idea because vendors
>>> don't expect IDENTIFY to be mixed regularly with normal IOs and
>>> firmwares aren't tested against that.  Even smart commands sometimes
>>> cause problems.  So, finding out the thing which is obsessed with the
>>> identity of the drive and stopping it might help.
>> We tracked this down to some (excessively frequent!) monitoring we
>> were doing using smartctl. Things were improved considerably by
>> stopping smartd and disabling all callers of smartctl, although it
>> doesn't appear to have been a cure. The frequency of these timeouts
>> during resync seems to have gone from about once every two hours to
>> about once a day, which means we've been able to complete some
>> resyncs whereas we were unable to before.
> 
> That's interesting.  One important side effect of issuing IDENTIFY is
> that they will serialize command streams as they are not NCQ commands
> and thus could change command patterns significantly.
..

SMART is the opcode that is most frequently implicated here, not IDENTIFY.
Note that even a barrier FLUSH CACHE is non NCQ and will serialize the stream.

Cheers

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/