[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <4A95349E.7010101@redhat.com>
Date: Wed, 26 Aug 2009 09:11:58 -0400
From: Ric Wheeler <rwheeler@...hat.com>
To: Theodore Tso <tytso@....edu>, Pavel Machek <pavel@....cz>,
david@...g.hm, Florian Weimer <fweimer@....de>,
Goswin von Brederlow <goswin-v-b@....de>,
Rob Landley <rob@...dley.net>,
kernel list <linux-kernel@...r.kernel.org>,
Andrew Morton <akpm@...l.org>, mtk.manpages@...il.com,
rdunlap@...otime.net, linux-doc@...r.kernel.org,
linux-ext4@...r.kernel.org, corbet@....net
Subject: Re: [patch] document flash/RAID dangers
On 08/26/2009 08:40 AM, Theodore Tso wrote:
> On Wed, Aug 26, 2009 at 07:58:40AM -0400, Ric Wheeler wrote:
>>> Drive in raid 5 failed; hot spare was available (no idea about
>>> UPS). System apparently locked up trying to talk to the failed drive,
>>> or maybe admin just was not patient enough, so he just powercycled the
>>> array. He lost the array.
>>>
>>> So while most people will not agressively powercycle the RAID array,
>>> drive failure still provokes little tested error paths, and getting
>>> unclean shutdown is quite easy in such case.
>>
>> Then what we need to document is do not power cycle an array during a
>> rebuild, right?
>
> Well, the softwar raid layer could be improved so that it implements
> scrubbing by default (i.e., have the md package install a cron job to
> implement a periodict scrub pass automatically). The MD code could
> also regularly check to make sure the hot spare is OK; the other
> possibility is that hot spare, which hadn't been used in a long time,
> had silently failed.
Actually, MD does this scan already (not automatically, but you can set up a
simple cron job to kick off a periodic "check"). It is a delicate balance to get
the frequency of the scrubbing correct.
On one hand, you want to make sure that you detect errors in a timely fashion,
certainly detection of single sector errors before you might develop a second
sector level error on another drive.
On the other hand, running scans/scrubs continually impacts the performance of
your real workload and can potentially impact your components' life span by
subjecting them to a heavy workload.
Rule of thumb seems from my experience is that most people settle in with a scan
once a week or two (done at a throttled rate).
>
>> In the end, there are cascading failures that will defeat any data
>> protection scheme, but that does not mean that the value of that scheme
>> is zero. We need to be get more people to use RAID (including MD5) and
>> try to enhance it as we go. Just using a single disk is not a good
>> thing...
>
> Yep; the solution is to improve the storage devices. It is *not* to
> encourage people to think RAID is not worth it, or that somehow ext2
> is better than ext3 because it runs fsck's all the time at boot up.
> That's just crazy talk.
>
> - Ted
Agreed....
ric
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists