linux-kernel - Re: [patch] document flash/RAID dangers

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20090826112121.GD26595@elf.ucw.cz>
Date:	Wed, 26 Aug 2009 13:21:21 +0200
From:	Pavel Machek <pavel@....cz>
To:	Ric Wheeler <rwheeler@...hat.com>
Cc:	david@...g.hm, Theodore Tso <tytso@....edu>,
	Florian Weimer <fweimer@....de>,
	Goswin von Brederlow <goswin-v-b@....de>,
	Rob Landley <rob@...dley.net>,
	kernel list <linux-kernel@...r.kernel.org>,
	Andrew Morton <akpm@...l.org>, mtk.manpages@...il.com,
	rdunlap@...otime.net, linux-doc@...r.kernel.org,
	linux-ext4@...r.kernel.org, corbet@....net
Subject: Re: [patch] document flash/RAID dangers

On Tue 2009-08-25 20:45:26, Ric Wheeler wrote:
> On 08/25/2009 08:38 PM, Pavel Machek wrote:
>>>>>> I'm not sure what's rare about power failures. Unlike single sector
>>>>>> errors, my machine actually has a button that produces exactly that
>>>>>> event. Running degraded raid5 arrays for extended periods may be
>>>>>> slightly unusual configuration, but I suspect people should just do
>>>>>> that for testing. (And from the discussion, people seem to think that
>>>>>> degraded raid5 is equivalent to raid0).
>>>>>
>>>>> Power failures after a full drive failure with a split write during a rebuild?
>>>>
>>>> Look, I don't need full drive failure for this to happen. I can just
>>>> remove one disk from array. I don't need power failure, I can just
>>>> press the power button. I don't even need to rebuild anything, I can
>>>> just write to degraded array.
>>>>
>>>> Given that all events are under my control, statistics make little
>>>> sense here.
>>>
>>> You are deliberately causing a double failure - pressing the power button
>>> after pulling a drive is exactly that scenario.
>>
>> Exactly. And now I'm trying to get that documented, so that people
>> don't do it and still expect their fs to be consistent.
>
> The problem I have is that the way you word it steers people away from 
> RAID5 and better data integrity. Your intentions are good, but your text 
> is going to do considerable harm.
>
> Most people don't intentionally drop power (or have a power failure) 
> during RAID rebuilds....

Example I seen went like this:

Drive in raid 5 failed; hot spare was available (no idea about
UPS). System apparently locked up trying to talk to the failed drive,
or maybe admin just was not patient enough, so he just powercycled the
array. He lost the array.

So while most people will not agressively powercycle the RAID array,
drive failure still provokes little tested error paths, and getting
unclean shutdown is quite easy in such case.
								Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/