linux-ext4 - Re: Selective Data Journaling in ext4

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20190213185334.GY23000@mit.edu>
Date:   Wed, 13 Feb 2019 13:53:34 -0500
From:   "Theodore Y. Ts'o" <tytso@....edu>
To:     Vijay Chidambaram <vijayc@...xas.edu>
CC:     Andreas Dilger <adilger@...ger.ca>, <linux-ext4@...r.kernel.org>,
        <jesus.palos@...xas.edu>
Subject: Re: Selective Data Journaling in ext4

On Wed, Feb 13, 2019 at 10:30:47AM -0600, Vijay Chidambaram wrote:
> Agreed, but another way to view this feature is that it is dynamic
> switching between ordered mode and data journaling mode. We switch to
> data journaling mode exactly when it is required, so you are right
> that most applications would never see a difference. But when it is
> required, this scheme would ensure stronger semantics are provided.
> Overall, it provides data-journaling guarantees all the time, and I
> was thinking some applications would like that peace of mind.

Switching back and forth orderred and data journalling mode is a bit
tricky.  (Insert "one does not simply walk into Morder" meme here).

See the comment in ext4_change_journal_flag() in fs/ext4/inode.c:

	/*
	 * We have to be very careful here: changing a data block's
	 * journaling status dynamically is dangerous.  If we write a
	 * data block to the journal, change the status and then delete
	 * that block, we risk forgetting to revoke the old log record
	 * from the journal and so a subsequent replay can corrupt data.
	 * So, first we make sure that the journal is empty and that
	 * nobody is changing anything.
	 */

What this means is that you have to track a list of blocks that has
ever been data journalled, because before we delete the file, we have
to write revoke all blocks belonging to that file on the list.
Similarly, if you switch from ordered to data journalling mode, all of
those blocks must be revoked.

This should also be done in a way that avoids serializing parallel
writes to the the inode.  That's not something we support today (yet),
but thare are some plans to allow parallel direct I/O writes to the
file.  Speaking of Direct I/O writes, as above, if a block that was
previously written via data journalling, the revoke block must be
submitted --- and committed --- before Direct I/O writes to that block
can be allowed.

> > Since we already have delalloc to pre-stage the dirty pages before the
> > write, we can make a good decision about whether the file data should
> > be written to the journal or directly to the filesystem.

Note that delalloc and data journalling is not compatible.  That being
said, if we are writing to not-yet-allocated block, recent discussions
of changing ext4 so that we only insert the block into the extent tree
in a workqueue triggered by the I/O callback for data block write, is
probably the better way of removing the data=ordered overhead.

Finally, this optimization only makes sense for HDD's, right?  For
SSD's, random writes are mostly free, and the cost of the double
write, not to mention the write amplification effect, probably makes
this not worthwhile.

Cheers,

						- Ted