[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <4A932B18.1020209@redhat.com>
Date: Mon, 24 Aug 2009 20:06:48 -0400
From: Ric Wheeler <rwheeler@...hat.com>
To: Pavel Machek <pavel@....cz>
CC: Theodore Tso <tytso@....edu>, Florian Weimer <fweimer@....de>,
Goswin von Brederlow <goswin-v-b@....de>,
Rob Landley <rob@...dley.net>,
kernel list <linux-kernel@...r.kernel.org>,
Andrew Morton <akpm@...l.org>, mtk.manpages@...il.com,
rdunlap@...otime.net, linux-doc@...r.kernel.org,
linux-ext4@...r.kernel.org, corbet@....net
Subject: Re: [patch] ext2/3: document conditions when reliable operation is
possible
Pavel Machek wrote:
> On Mon 2009-08-24 18:39:15, Theodore Tso wrote:
>
>> On Mon, Aug 24, 2009 at 11:25:19PM +0200, Pavel Machek wrote:
>>
>>>> I have to admit that I have not paid enough attention to this specifics
>>>> of your ext3 + flash card issue - is it the ftl stuff doing out of order
>>>> IO's?
>>>>
>>> The problem is that flash cards destroy whole erase block on unplug,
>>> and ext3 can't cope with that.
>>>
>> Sure --- but name **any** filesystem that can deal with the fact that
>> 128k or 256k worth of data might disappear when you pull out the flash
>> card while it is writing a single sector?
>>
>
> First... I consider myself quite competent in the os level, yet I did
> not realize what flash does and what that means for data
> integrity. That means we need some documentation, or maybe we should
> refuse to mount those devices r/w or something.
>
> Then to answer your question... ext2. You expect to run fsck after
> unclean shutdown, and you expect to have to solve some problems with
> it. So the way ext2 deals with the flash media actually matches what
> the user expects. (*)
>
> OTOH in ext3 case you expect consistent filesystem after unplug; and
> you don't get that.
>
>
>>>> Your statement is overly broad - ext3 on a commercial RAID array that
>>>> does RAID5 or RAID6, etc has no issues that I know of.
>>>>
>>> If your commercial RAID array is battery backed, maybe. But I was
>>> talking Linux MD here.
>>>
> ...
>
>> If your concern is that with Linux MD, you could potentially lose an
>> entire stripe in RAID 5 mode, then you should say that explicitly; but
>> again, this isn't a filesystem specific cliam; it's true for all
>> filesystems. I don't know of any file system that can survive having
>> a RAID stripe-shaped-hole blown into the middle of it due to a power
>> failure.
>>
>
> Again, ext2 handles that in a way user expects it.
>
> At least I was teached "ext2 needs fsck after powerfail; ext3 can
> handle powerfails just ok".
>
>
So, would you be happy if ext3 fsck was always run on reboot (at least
for flash devices)?
ric
>> I'll note, BTW, that AIX uses a journal to protect against these sorts
>> of problems with software raid; this also means that with AIX, you
>> also don't have to rebuild a RAID 1 device after an unclean shutdown,
>> like you have do with Linux MD. This was on the EVMS's team
>> development list to implement for Linux, but it got canned after LVM
>> won out, lo those many years ago. Ce la vie; but it's a problem which
>> is solvable at the RAID layer, and which is traditionally and
>> historically solved in competent RAID implementations.
>>
>
> Yep, we should add journal to RAID; or at least write "Linux MD
> *needs* an UPS" in big and bold letters. I'm trying to do the second
> part.
>
> (Attached is current version of the patch).
>
> [If you'd prefer patch saying that MMC/USB flash/Linux MD arrays are
> generaly unsafe to use without UPS/reliable connection/no kernel
> bugs... then I may try to push that. I was not sure... maybe some
> filesystem _can_ handle this kind of issues?]
>
> Pavel
>
> (*) Ok, now... user expects to run fsck, but very advanced users may
> not expect old data to be damaged. Certainly I was not advanced enough
> user few months ago.
>
> diff --git a/Documentation/filesystems/expectations.txt b/Documentation/filesystems/expectations.txt
> new file mode 100644
> index 0000000..d1ef4d0
> --- /dev/null
> +++ b/Documentation/filesystems/expectations.txt
> @@ -0,0 +1,57 @@
> +Linux block-backed filesystems can only work correctly when several
> +conditions are met in the block layer and below (disks, flash
> +cards). Some of them are obvious ("data on media should not change
> +randomly"), some are less so. Not all filesystems require all of these
> +to be satisfied for safe operation.
> +
> +Write errors not allowed (NO-WRITE-ERRORS)
> +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> +
> +Writes to media never fail. Even if disk returns error condition
> +during write, filesystems can't handle that correctly.
> +
> + Fortunately writes failing are very uncommon on traditional
> + spinning disks, as they have spare sectors they use when write
> + fails.
> +
> +Don't cause collateral damage on a failed write (NO-COLLATERALS)
> +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> +
> +On some storage systems, failed write (for example due to power
> +failure) kills data in adjacent (or maybe unrelated) sectors.
> +
> +Unfortunately, cheap USB/SD flash cards I've seen do have this bug,
> +and are thus unsuitable for all filesystems I know.
> +
> + An inherent problem with using flash as a normal block device
> + is that the flash erase size is bigger than most filesystem
> + sector sizes. So when you request a write, it may erase and
> + rewrite some 64k, 128k, or even a couple megabytes on the
> + really _big_ ones.
> +
> + If you lose power in the middle of that, filesystem won't
> + notice that data in the "sectors" _around_ the one your were
> + trying to write to got trashed.
> +
> + MD RAID-4/5/6 in degraded mode has similar problem, stripes
> + behave similary to eraseblocks.
> +
> +
> +Don't damage the old data on a powerfail (ATOMIC-WRITES-ON-POWERFAIL)
> +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> +
> +Either whole sector is correctly written or nothing is written during
> +powerfail.
> +
> + Because RAM tends to fail faster than rest of system during
> + powerfail, special hw killing DMA transfers may be necessary;
> + otherwise, disks may write garbage during powerfail.
> + This may be quite common on generic PC machines.
> +
> + Note that atomic write is very hard to guarantee for MD RAID-4/5/6,
> + because it needs to write both changed data, and parity, to
> + different disks. (But it will only really show up in degraded mode).
> + UPS for RAID array should help.
> +
> +
> +
> diff --git a/Documentation/filesystems/ext2.txt b/Documentation/filesystems/ext2.txt
> index 67639f9..ef9ff0f 100644
> --- a/Documentation/filesystems/ext2.txt
> +++ b/Documentation/filesystems/ext2.txt
> @@ -338,27 +339,30 @@ enough 4-character names to make up unique directory entries, so they
> have to be 8 character filenames, even then we are fairly close to
> running out of unique filenames.
>
> +Requirements
> +============
> +
> +Ext2 expects disk/storage subsystem to behave sanely. On sanely
> +behaving disk subsystem, data that have been successfully synced will
> +stay on the disk. Sane means:
> +
> +* write errors not allowed (NO-WRITE-ERRORS)
> +
> +* don't damage the old data on a failed write (ATOMIC-WRITES-ON-POWERFAIL)
> +
> +and obviously:
> +
> +* don't cause collateral damage to adjacent sectors on a failed write
> + (NO-COLLATERALS)
> +
> +(see expectations.txt; note that most/all linux block-based
> +filesystems have similar expectations)
> +
> +* write caching is disabled. ext2 does not know how to issue barriers
> + as of 2.6.28. hdparm -W0 disables it on SATA disks.
> +
> Journaling
> -----------
> -
> -A journaling extension to the ext2 code has been developed by Stephen
> -Tweedie. It avoids the risks of metadata corruption and the need to
> -wait for e2fsck to complete after a crash, without requiring a change
> -to the on-disk ext2 layout. In a nutshell, the journal is a regular
> -file which stores whole metadata (and optionally data) blocks that have
> -been modified, prior to writing them into the filesystem. This means
> -it is possible to add a journal to an existing ext2 filesystem without
> -the need for data conversion.
> -
> -When changes to the filesystem (e.g. a file is renamed) they are stored in
> -a transaction in the journal and can either be complete or incomplete at
> -the time of a crash. If a transaction is complete at the time of a crash
> -(or in the normal case where the system does not crash), then any blocks
> -in that transaction are guaranteed to represent a valid filesystem state,
> -and are copied into the filesystem. If a transaction is incomplete at
> -the time of the crash, then there is no guarantee of consistency for
> -the blocks in that transaction so they are discarded (which means any
> -filesystem changes they represent are also lost).
> +==========
> Check Documentation/filesystems/ext3.txt if you want to read more about
> ext3 and journaling.
>
> diff --git a/Documentation/filesystems/ext3.txt b/Documentation/filesystems/ext3.txt
> index 570f9bd..752f4b4 100644
> --- a/Documentation/filesystems/ext3.txt
> +++ b/Documentation/filesystems/ext3.txt
> @@ -199,6 +202,43 @@ debugfs: ext2 and ext3 file system debugger.
> ext2online: online (mounted) ext2 and ext3 filesystem resizer
>
>
> +Requirements
> +============
> +
> +Ext3 expects disk/storage subsystem to behave sanely. On sanely
> +behaving disk subsystem, data that have been successfully synced will
> +stay on the disk. Sane means:
> +
> +* write errors not allowed (NO-WRITE-ERRORS)
> +
> +* don't damage the old data on a failed write (ATOMIC-WRITES-ON-POWERFAIL)
> +
> + Ext3 handles trash getting written into sectors during powerfail
> + surprisingly well. It's not foolproof, but it is resilient.
> + Incomplete journal entries are ignored, and journal replay of
> + complete entries will often "repair" garbage written into the inode
> + table. The data=journal option extends this behavior to file and
> + directory data blocks as well.
> +
> +
> +and obviously:
> +
> +* don't cause collateral damage to adjacent sectors on a failed write
> + (NO-COLLATERALS)
> +
> +
> +(see expectations.txt; note that most/all linux block-based
> +filesystems have similar expectations)
> +
> +* either write caching is disabled, or hw can do barriers and they are enabled.
> +
> + (Note that barriers are disabled by default, use "barrier=1"
> + mount option after making sure hw can support them).
> +
> + hdparm -I reports disk features. If you have "Native
> + Command Queueing" is the feature you are looking for.
> +
> +
> References
> ==========
>
>
>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists