[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20081202163720.GB18162@mit.edu>
Date: Tue, 2 Dec 2008 11:37:20 -0500
From: Theodore Tso <tytso@....edu>
To: Pavel Machek <pavel@...e.cz>
Cc: mikulas@...ax.karlin.mff.cuni.cz, clock@...ey.karlin.mff.cuni.cz,
kernel list <linux-kernel@...r.kernel.org>, aviro@...hat.com
Subject: Re: writing file to disk: not as easy as it looks
On Tue, Dec 02, 2008 at 04:26:18PM +0100, Pavel Machek wrote:
> > I can understand why you might want to fsync the containing directory
> > to make sure the directory entry got written to disk --- but if you're
> > that paranoid, many modern filesystems use some kind of tree
> > structure
>
> If I'm trying to write foo/bar/baz/file, and file/baz inodes/dentries
> are written to disk, but foo is not, file still will not be found
> under full name - and recovering it from lost&found is hard to do
> automatically.
Only if you've freshly created the foo/bar/baz directories... If you
have, then yes, you'll need to sync each one. Normally the paranoid
programs do this after each mkdir call, though.
For ext3/ext4, becaused of the entangled commit factor, fsync()'ing
the file is sufficient, but that's not something you can properly
count upon.
> If disk looses data after acknowledging the write, all hope is lost.
> Else I expect filesystem to preserve data I successfully synced.
>
> (In the b-tree split failed case I'd expect transaction commit to
> fail because new data could not be weitten; at that point
> disk+journal should still contain all the data needed for
> recovery of synced/old files, right?)
Not necessarily. For filesystems that do logical journalling (i.e.,
xfs, jfs, et. al), the only thing written in the journal is the
logical change (i.e., "new dir entry 'file_that_causes_the_node_split'").
The transaction commits *first*, and then the filesystem tries to
write update the filesystem with the change, and it's only then that
the write fails. Data can very easily get lost.
Even for ext3/ext4 which is doing physical journalling, it's still the
case that the journal commits first, and it's only later when the
write happens that we write out the change. If the disk fails some of
the writes, it's possible to lose data, especially if the two blocks
involved in the node split are far apart, and the write to the
existing old btree block fails.
> > What exactly are your requirements here, and what are you trying to
> > do? What are you worried about? Most MTA's are quite happy
> > settling
>
> I'm trying to put my main filesystem on a SD card. hp2133 has only 4GB
> internal flash, so I got 32GB SDHC. Unfortunately, SD card on hp is
> very easy to eject by mistake.
So what you really want is some way of constantly flushing data to the
disk, probably after every single mkdir, every single close operation.
Of course, that has the tradeoff your flash card will get a lot of
extra wear. I hate to say this, but have you considered something
like tape or velcro to secure the SD card?
- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists