[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <4AB23B17.2040204@rtr.ca>
Date: Thu, 17 Sep 2009 09:35:19 -0400
From: Mark Lord <liml@....ca>
To: Tejun Heo <tj@...nel.org>
Cc: Chris Webb <chris@...chsys.com>, Ric Wheeler <rwheeler@...hat.com>,
Andrei Tanas <andrei@...as.ca>, NeilBrown <neilb@...e.de>,
linux-kernel@...r.kernel.org,
IDE/ATA development list <linux-ide@...r.kernel.org>,
linux-scsi@...r.kernel.org, Jeff Garzik <jgarzik@...hat.com>,
Mark Lord <mlord@...ox.com>
Subject: Re: MD/RAID time out writing superblock
Tejun Heo wrote:
> Hello,
>
> Chris Webb wrote:
>> Hi Tejun. Thanks for following up to this. We've done some more
>> experimentation over the last couple of days based on your
>> suggestions and thoughts.
>>
>> Tejun Heo <tj@...nel.org> writes:
>>> Seriously, it's most likely a hardware malfunction although I can't tell
>>> where the problem is with the given data. Get the hardware fixed.
>> We know this isn't caused by a single faulty piece of hardware,
>> because we have a cluster of identical machines and all have shown
>> this behaviour. This doesn't mean that there isn't a hardware
>> problem, but if there is one, it's a design problem or firmware bug
>> affecting all of our hosts.
>
> If it's multiple machines, it's much less likely to be faulty drives,
> but if the machines are configured mostly identically, hardware
> problems can't be ruled out either.
>
>> There have also been a few reports of problems which look very
>> similar in this thread from people with somewhat different hardware
>> and drives to ours.
>
> I wouldn't connect the reported cases too eagerly at this point. Too
> many different causes end up showing similar symptoms especially with
> timeouts.
>
>>> The aboves are IDENTIFY. Who's issuing IDENTIFY regularly? It isn't
>>> from the regular IO paths or md. It's probably being issued via SG_IO
>>> from userland. These failures don't affect normal operation.
>> [...]
>>> Oooh, another possibility is the above continuous IDENTIFY tries.
>>> Doing things like that generally isn't a good idea because vendors
>>> don't expect IDENTIFY to be mixed regularly with normal IOs and
>>> firmwares aren't tested against that. Even smart commands sometimes
>>> cause problems. So, finding out the thing which is obsessed with the
>>> identity of the drive and stopping it might help.
>> We tracked this down to some (excessively frequent!) monitoring we
>> were doing using smartctl. Things were improved considerably by
>> stopping smartd and disabling all callers of smartctl, although it
>> doesn't appear to have been a cure. The frequency of these timeouts
>> during resync seems to have gone from about once every two hours to
>> about once a day, which means we've been able to complete some
>> resyncs whereas we were unable to before.
>
> That's interesting. One important side effect of issuing IDENTIFY is
> that they will serialize command streams as they are not NCQ commands
> and thus could change command patterns significantly.
..
SMART is the opcode that is most frequently implicated here, not IDENTIFY.
Note that even a barrier FLUSH CACHE is non NCQ and will serialize the stream.
Cheers
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists