[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20110628093652.GA29978@quack.suse.cz>
Date: Tue, 28 Jun 2011 11:36:52 +0200
From: Jan Kara <jack@...e.cz>
To: "Moffett, Kyle D" <Kyle.D.Moffett@...ing.com>
Cc: Ted Ts'o <tytso@....edu>, Lukas Czerner <lczerner@...hat.com>,
Jan Kara <jack@...e.cz>, Sean Ryle <seanbo@...il.com>,
"615998@...s.debian.org" <615998@...s.debian.org>,
"linux-ext4@...r.kernel.org" <linux-ext4@...r.kernel.org>,
Sachin Sant <sachinp@...ibm.com>,
"Aneesh Kumar K.V" <aneesh.kumar@...ux.vnet.ibm.com>
Subject: Re: Bug#615998: linux-image-2.6.32-5-xen-amd64: Repeatable "kernel
BUG at fs/jbd2/commit.c:534" from Postfix on ext4
On Mon 27-06-11 23:21:17, Moffett, Kyle D wrote:
> On Jun 27, 2011, at 12:01, Ted Ts'o wrote:
> > On Mon, Jun 27, 2011 at 05:30:11PM +0200, Lukas Czerner wrote:
> >>> I've found some. So although data=journal users are minority, there are
> >>> some. That being said I agree with you we should do something about it
> >>> - either state that we want to fully support data=journal - and then we
> >>> should really do better with testing it or deprecate it and remove it
> >>> (which would save us some complications in the code).
> >>>
> >>> I would be slightly in favor of removing it (code simplicity, less options
> >>> to configure for admin, less options to test for us, some users I've come
> >>> across actually were not quite sure why they are using it - they just
> >>> thought it looks safer).
> >
> > Hmm... FYI, I hope to be able to bring on line automated testing for
> > ext4 later this summer (there's a testing person at Google is has
> > signed up to work on setting this up as his 20% project). The test
> > matrix that I have him included data=journal, so we will be getting
> > better testing in the near future.
> >
> > At least historically, data=journalling was the *simpler* case, and
> > was the first thing supported by ext4. (data=ordered required revoke
> > handling which didn't land for six months or so). So I'm not really
> > that convinced that removing really buys us that much code
> > simplification.
> >
> > That being siad, it is true that data=journalled isn't necessarily
> > faster. For heavy disk-bound workloads, it can be slower. So I can
> > imagine adding some documentation that warns people not to use
> > data=journal unless they really know what they are doing, but at least
> > personally, I'm a bit reluctant to dispense with a bug report like
> > this by saying, "oh, that feature should be deprecated".
>
> I suppose I should chime in here, since I'm the one who (potentially
> incorrectly) thinks I should be using data=journalled mode.
>
> My basic impression is that the use of "data=journalled" can help
> reduce the risk (slightly) of serious corruption to some kinds of
> databases when the application does not provide appropriate syncs
> or journalling on its own (IE: such as text-based Wiki database files).
It depends on the way such programs update the database files. But
generally yeas, data=journal provides a bit more guarantees than other
journaling modes - see below.
> Please correct me if this is horribly horribly wrong:
>
> no journal:
> Nothing is journalled
> + Very fast.
> + Works well for filesystems that are "mkfs"ed on every boot
> - Have to fsck after every reboot
Fsck is needed only after a crash / hard powerdown. Otherwise completely
correct. Plus you always have a possibility of exposing uninitialized
(potentially sensitive) data after a fsck.
Actually, normal desktop might be quite happy with non-journaled filesystem
when fsck is fask enough.
> data=writeback:
> Metadata is journalled, data (to allocated extents) may be written
> before or after the metadata is updated with a new file size.
> + Fast (not as fast as unjournalled)
> + No need to "fsck" after a hard power-down
> - A crash or power failure in the middle of a write could leave
> old data on disk at the end of a file. If security labeling
> such as SELinux is enabled, this could "contaminate" a file with
> data from a deleted file that was at a higher sensitivity.
> Log files (including binary database replication logs) may be
> effectively corrupted as a result.
Correct.
> data=ordered:
> Data appended to a file will be written before the metadata
> extending the length of the file is written, and in certain cases
> the data will be written before file renames (partial ordering),
> but the data itself is unjournalled, and may be only partially
> complete for updates.
> + Does not write data to the media twice
> + A crash or power failure will not leave old uninitialized data
> in files.
> - Data writes to files may only partially complete in the event
> of a crash. No problems for logfiles, or self-journalled
> application databases, but others may experience partial writes
> in the event of a crash and need recovery.
Correct, one should also note that noone guarantees order in which data
hits the disk - i.e. when you do write(f,"a"); write(f,"b"); and these are
overwrites it may happen that "b" is written while "a" is not.
> data=journalled:
> Data and metadata are both journalled, meaning that a given data
> write will either complete or it will never occur, although the
> precise ordering is not guaranteed. This also implies all of the
> data<=>metadata guarantees of data=ordered.
> + Direct IO data writes are effectively "atomic", resulting in
> less likelihood of data loss for application databases which do
> not do their own journalling. This means that a power failure
> or system crash will not result in a partially-complete write.
Well, direct IO is atomic in data=journal the same way as in data=ordered.
It can happen only half of direct IO write is done when you hit power
button at the right moment - note this holds for overwrites. Extending
writes or writes to holes are all-or-nothing for ext4 (again both in
data=journal and data=ordered mode).
> - Cached writes are not atomic
> + For small cached file writes (of only a few filesystem pages)
> there is a good chance that kernel writeback will queue the
> entire write as a single I/O and it will be "protected" as a
> result. This helps reduce the chance of serious damage to some
> text-based database files (such as those for some Wikis), but
> is obviously not a guarantee.
Page sized and page aligned writes are atomic (in both data=journal and
data=ordered modes). When a write spans multiple pages, there are chances
the writes will be merged in a single transaction but no guarantees as you
properly write.
> - This writes all data to the block device twice (once to the FS
> journal and once to the data blocks). This may be especially bad
> for write-limited Flash-backed devices.
Correct.
To sum up, the only additional guarantee data=journal offers against
data=ordered is a total ordering of all IO operations. That is, if you do a
sequence of data and metadata operations, then you are guaranteed that
after a crash you will see the filesystem in a state corresponding exactly
to your sequence terminated at some (arbitrary) point. Data writes are
disassembled into page-sized & page-aligned sequence of writes for purpose
of this model...
Honza
--
Jan Kara <jack@...e.cz>
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Powered by blists - more mailing lists