linux-ext4 - Re: Bug#615998: linux-image-2.6.32-5-xen-amd64: Repeatable "kernel BUG at fs/jbd2/commit.c:534" from Postfix on ext4

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20110628093652.GA29978@quack.suse.cz>
Date:	Tue, 28 Jun 2011 11:36:52 +0200
From:	Jan Kara <jack@...e.cz>
To:	"Moffett, Kyle D" <Kyle.D.Moffett@...ing.com>
Cc:	Ted Ts'o <tytso@....edu>, Lukas Czerner <lczerner@...hat.com>,
	Jan Kara <jack@...e.cz>, Sean Ryle <seanbo@...il.com>,
	"615998@...s.debian.org" <615998@...s.debian.org>,
	"linux-ext4@...r.kernel.org" <linux-ext4@...r.kernel.org>,
	Sachin Sant <sachinp@...ibm.com>,
	"Aneesh Kumar K.V" <aneesh.kumar@...ux.vnet.ibm.com>
Subject: Re: Bug#615998: linux-image-2.6.32-5-xen-amd64: Repeatable "kernel
 BUG at fs/jbd2/commit.c:534" from Postfix on ext4

On Mon 27-06-11 23:21:17, Moffett, Kyle D wrote:
> On Jun 27, 2011, at 12:01, Ted Ts'o wrote:
> > On Mon, Jun 27, 2011 at 05:30:11PM +0200, Lukas Czerner wrote:
> >>> I've found some. So although data=journal users are minority, there are
> >>> some. That being said I agree with you we should do something about it
> >>> - either state that we want to fully support data=journal - and then we
> >>> should really do better with testing it or deprecate it and remove it
> >>> (which would save us some complications in the code).
> >>> 
> >>> I would be slightly in favor of removing it (code simplicity, less options
> >>> to configure for admin, less options to test for us, some users I've come
> >>> across actually were not quite sure why they are using it - they just
> >>> thought it looks safer).
> > 
> > Hmm...  FYI, I hope to be able to bring on line automated testing for
> > ext4 later this summer (there's a testing person at Google is has
> > signed up to work on setting this up as his 20% project).  The test
> > matrix that I have him included data=journal, so we will be getting
> > better testing in the near future.
> > 
> > At least historically, data=journalling was the *simpler* case, and
> > was the first thing supported by ext4.  (data=ordered required revoke
> > handling which didn't land for six months or so).  So I'm not really
> > that convinced that removing really buys us that much code
> > simplification.
> > 
> > That being siad, it is true that data=journalled isn't necessarily
> > faster.  For heavy disk-bound workloads, it can be slower.  So I can
> > imagine adding some documentation that warns people not to use
> > data=journal unless they really know what they are doing, but at least
> > personally, I'm a bit reluctant to dispense with a bug report like
> > this by saying, "oh, that feature should be deprecated".
> 
> I suppose I should chime in here, since I'm the one who (potentially
> incorrectly) thinks I should be using data=journalled mode.
> 
> My basic impression is that the use of "data=journalled" can help
> reduce the risk (slightly) of serious corruption to some kinds of
> databases when the application does not provide appropriate syncs
> or journalling on its own (IE: such as text-based Wiki database files).
  It depends on the way such programs update the database files. But
generally yeas, data=journal provides a bit more guarantees than other
journaling modes - see below.

> Please correct me if this is horribly horribly wrong:
> 
> no journal:
>   Nothing is journalled
>   + Very fast.
>   + Works well for filesystems that are "mkfs"ed on every boot
>   - Have to fsck after every reboot
Fsck is needed only after a crash / hard powerdown. Otherwise completely
correct. Plus you always have a possibility of exposing uninitialized
(potentially sensitive) data after a fsck.

Actually, normal desktop might be quite happy with non-journaled filesystem
when fsck is fask enough.

> data=writeback:
>   Metadata is journalled, data (to allocated extents) may be written
>   before or after the metadata is updated with a new file size.
>   + Fast (not as fast as unjournalled)
>   + No need to "fsck" after a hard power-down
>   - A crash or power failure in the middle of a write could leave
>     old data on disk at the end of a file.  If security labeling
>     such as SELinux is enabled, this could "contaminate" a file with
>     data from a deleted file that was at a higher sensitivity.
>     Log files (including binary database replication logs) may be
>     effectively corrupted as a result.
Correct.

> data=ordered:
>   Data appended to a file will be written before the metadata
>   extending the length of the file is written, and in certain cases
>   the data will be written before file renames (partial ordering),
>   but the data itself is unjournalled, and may be only partially
>   complete for updates.
>   + Does not write data to the media twice
>   + A crash or power failure will not leave old uninitialized data
>     in files.
>   - Data writes to files may only partially complete in the event
>     of a crash.  No problems for logfiles, or self-journalled
>     application databases, but others may experience partial writes
>     in the event of a crash and need recovery.
Correct, one should also note that noone guarantees order in which data
hits the disk - i.e. when you do write(f,"a"); write(f,"b"); and these are
overwrites it may happen that "b" is written while "a" is not.

> data=journalled:
>   Data and metadata are both journalled, meaning that a given data
>   write will either complete or it will never occur, although the
>   precise ordering is not guaranteed.  This also implies all of the
>   data<=>metadata guarantees of data=ordered.
>   + Direct IO data writes are effectively "atomic", resulting in
>     less likelihood of data loss for application databases which do
>     not do their own journalling.  This means that a power failure
>     or system crash will not result in a partially-complete write.
Well, direct IO is atomic in data=journal the same way as in data=ordered.
It can happen only half of direct IO write is done when you hit power
button at the right moment - note this holds for overwrites.  Extending
writes or writes to holes are all-or-nothing for ext4 (again both in
data=journal and data=ordered mode).

>   - Cached writes are not atomic
>   + For small cached file writes (of only a few filesystem pages)
>     there is a good chance that kernel writeback will queue the
>     entire write as a single I/O and it will be "protected" as a
>     result.  This helps reduce the chance of serious damage to some
>     text-based database files (such as those for some Wikis), but
>     is obviously not a guarantee.
Page sized and page aligned writes are atomic (in both data=journal and
data=ordered modes). When a write spans multiple pages, there are chances
the writes will be merged in a single transaction but no guarantees as you
properly write.

>   - This writes all data to the block device twice (once to the FS
>     journal and once to the data blocks).  This may be especially bad
>     for write-limited Flash-backed devices.
Correct.

To sum up, the only additional guarantee data=journal offers against
data=ordered is a total ordering of all IO operations. That is, if you do a
sequence of data and metadata operations, then you are guaranteed that
after a crash you will see the filesystem in a state corresponding exactly
to your sequence terminated at some (arbitrary) point. Data writes are
disassembled into page-sized & page-aligned sequence of writes for purpose
of this model...

								Honza
-- 
Jan Kara <jack@...e.cz>
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html