linux-ext4 - Re: ext4 out of order when use cfq scheduler

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20160315144633.GA12352@quack.suse.cz>
Date:	Tue, 15 Mar 2016 15:46:33 +0100
From:	Jan Kara <jack@...e.cz>
To:	Theodore Ts'o <tytso@....edu>
Cc:	Jan Kara <jack@...e.cz>,
	"HUANG Weller (CM/ESW12-CN)" <Weller.Huang@...bosch.com>,
	"linux-ext4@...r.kernel.org" <linux-ext4@...r.kernel.org>,
	"Li, Michael" <huayil@....qualcomm.com>
Subject: Re: ext4 out of order when use cfq scheduler

On Tue 15-03-16 11:46:34, Jan Kara wrote:
> On Mon 14-03-16 10:36:35, Ted Tso wrote:
> > On Mon, Mar 14, 2016 at 08:39:28AM +0100, Jan Kara wrote:
> > > No, that won't be enough. blkdev_issue_flush() is not guaranteed to do
> > > anything to IOs which have not reported completion before
> > > blkdev_issue_flush() was called. Specifically, CFQ will queue submitted bio
> > > in its internal RB tree, following flush request completely bypasses this
> > > tree and goes directly to the disk where it flushes caches. And only later
> > > CFQ decides to schedule async writeback from the flusher thread which is
> > > queued in the RB tree...
> > 
> > Oh, right.  I am forgetting about the flushing mahchinery rewrite.
> > Thanks for pointing that out.
> > 
> > But what we *could* do is to swap those two calls and then in the case
> > where delalloc is enabled, could maintain a list of inodes where we
> > only need to call filemap_fdatawait(), and not initiate writeback for
> > any dirty pages which had been caused by non-allocating writes.
> 
> We actually don't need to swap those two calls - page is already marked as
> under writeback in
> 
>   mpage_map_and_submit_buffers() -> mpage_submit_page -> ext4_bio_write_page
> 
> which gets called while we still hold the transaction handle. I agree
> calling filemap_fdatawait() from JBD2 during commit should be enough to fix
> issues with delalloc writeback. I'm just somewhat afraid that it will be
> more fragile: If we add inode to transaction's list in ext4_map_blocks(),
> we are pretty sure there's no way to allocate block to an inode without
> introducing data exposure issues (which are then very hard to spot). If we
> depend on callers of ext4_map_blocks() to properly add inode to appropriate
> transaction list, we have much more places to check. I'll think whether we
> could make this more robust.

OK, I have something - Huang, can you check whether the attached patches
also fix your data exposure issues please? The first patch is the original
fix, patch two is a cleanup, patches 3 and 4 implement the speedup
suggested by Ted. Patches are only lightly tested so far.  I'll run more
comprehensive tests later and in particular I want to check whether the
additional complexity actually brings us some advantage at least for
workloads which redirty pages in addition to writing some new ones using
delayed allocation.

								Honza
-- 
Jan Kara <jack@...e.com>
SUSE Labs, CR

View attachment "0001-ext4-Fix-data-exposure-after-a-crash.patch" of type "text/x-patch" (3300 bytes)

View attachment "0002-ext4-Remove-EXT4_STATE_ORDERED_MODE.patch" of type "text/x-patch" (1813 bytes)

View attachment "0003-jbd2-Add-support-for-avoiding-data-writes-during-tra.patch" of type "text/x-patch" (7859 bytes)

View attachment "0004-ext4-Do-not-ask-jbd2-to-write-data-for-delalloc-buff.patch" of type "text/x-patch" (5007 bytes)