[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20081202205558.GD20858@mit.edu>
Date: Tue, 2 Dec 2008 15:55:58 -0500
From: Theodore Tso <tytso@....edu>
To: Chris Friesen <cfriesen@...tel.com>
Cc: Pavel Machek <pavel@...e.cz>, mikulas@...ax.karlin.mff.cuni.cz,
clock@...ey.karlin.mff.cuni.cz,
kernel list <linux-kernel@...r.kernel.org>, aviro@...hat.com
Subject: Re: writing file to disk: not as easy as it looks
On Tue, Dec 02, 2008 at 11:22:58AM -0600, Chris Friesen wrote:
> Theodore Tso wrote:
>
>> Even for ext3/ext4 which is doing physical journalling, it's still the
>> case that the journal commits first, and it's only later when the
>> write happens that we write out the change. If the disk fails some of
>> the writes, it's possible to lose data, especially if the two blocks
>> involved in the node split are far apart, and the write to the
>> existing old btree block fails.
>
> Yikes. I was under the impression that once the journal hit the platter
> then the data were safe (barring media corruption).
Well, this is a case of media corruption (or a cosmic ray hitting
hitting a ribbon cable in the disk controller sending the write to the
wrong location on disk, or someone bumping the server causing the disk
head to lift up a little higher than normal while it was writing the
disk sector, etc.). But it is a case of the hard drive misbehaving.
Heck, if you have a hiccup while writing an inode table block out to
disk (for example a power failure at just the wrong time), so the
memory (which is more voltage sensitive than hard drives) DMA's
garbage which gets written to the inode table, you could lose a large
number of adjacent inodes when garbage gets splatted over the inode
table.
Ext3 tends to recover from this better than other filesystems, thanks
to the fact that it does physical block journalling, but you do pay
for this in terms of performance if you have a metadata-intensive
workload, because you're writing more bytes to the journal for each
metadata opeation.
> It seems like the more I learn about filesystems, the more failure modes
> there are and the fewer guarantees can be made. It's amazing that
> things work as well as they do...
There are certainly things you can do. Put your fileservers's on
UPS's. Use RAID. Make backups. Do all three. :-)
- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists