[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <4A9FCF6B.1080704@redhat.com>
Date: Thu, 03 Sep 2009 10:15:07 -0400
From: Ric Wheeler <rwheeler@...hat.com>
To: Krzysztof Halasa <khc@...waw.pl>
CC: Christoph Hellwig <hch@...radead.org>, Mark Lord <lkml@....ca>,
Michael Tokarev <mjt@....msk.ru>, david@...g.hm,
Pavel Machek <pavel@....cz>, Theodore Tso <tytso@....edu>,
NeilBrown <neilb@...e.de>, Rob Landley <rob@...dley.net>,
Florian Weimer <fweimer@....de>,
Goswin von Brederlow <goswin-v-b@....de>,
kernel list <linux-kernel@...r.kernel.org>,
Andrew Morton <akpm@...l.org>, mtk.manpages@...il.com,
rdunlap@...otime.net, linux-doc@...r.kernel.org,
linux-ext4@...r.kernel.org, corbet@....net
Subject: wishful thinking about atomic, multi-sector or full MD stripe width,
writes in storage
On 09/03/2009 09:59 AM, Krzysztof Halasa wrote:
> Ric Wheeler<rwheeler@...hat.com> writes:
>
>> We (red hat) have all kinds of different raid boxes...
>
> A have no doubt about it, but are those you know equipped with
> battery-backed write-back cache? Are they using SATA disks?
>
> We can _at_best_ compare non-battery-backed RAID using SATA disks with
> what we typically have in a PC.
The whole thread above is about software MD using commodity drives (S-ATA or
SAS) without battery backed write cache.
We have that (and I have it personally) and do test it.
You must disable the write cache on these commodity drives *if* the MD RAID
level does not support barriers properly.
This will greatly reduce errors after a power loss (both in degraded state and
non-degraded state), but it will not eliminate data loss entirely. You simply
cannot do that with any storage device!
Note that even without MD raid, the file system issues IO's in file system block
size (4096 bytes normally) and most commodity storage devices use a 512 byte
sector size which means that we have to update 8 512b sectors.
Drives can (and do) have multiple platters and surfaces and it is perfectly
normal to have contiguous logical ranges of sectors map to non-contiguous
sectors physically. Imagine a 4KB write stripe that straddles two adjacent
tracks on one platter (requiring a seek) or mapped across two surfaces
(requiring a head switch). Also, a remapped sector can require more or less a
full surface seek from where ever you are to the remapped sector area of the drive.
These are all examples that can after a power loss, even a local (non-MD)
device, do a partial update of that 4KB write range of sectors. Note that
unlike unlike RAID/MD, local storage has no parity on the server to detect this
partial write.
This is why new file systems like btrfs and zfs do checksumming of data and
metadata. This won't prevent partial updates during a write, but can at least
detect them and try to do some kind of recovery.
In other words, this is not just an MD issue, it is entirely possible even with
non-MD devices.
Also, when you enable the write cache (MD or not) you are buffering multiple
MB's of data that can go away on power loss. Far greater (10x) the exposure that
the partial RAID rewrite case worries about.
ric
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists