linux-ext4 - Re: ext4 out of order when use cfq scheduler

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20160315200951.GA1445@quack.suse.cz>
Date:	Tue, 15 Mar 2016 21:09:51 +0100
From:	Jan Kara <jack@...e.cz>
To:	Theodore Ts'o <tytso@....edu>
Cc:	Jan Kara <jack@...e.cz>,
	"HUANG Weller (CM/ESW12-CN)" <Weller.Huang@...bosch.com>,
	"linux-ext4@...r.kernel.org" <linux-ext4@...r.kernel.org>,
	"Li, Michael" <huayil@....qualcomm.com>
Subject: Re: ext4 out of order when use cfq scheduler

On Tue 15-03-16 15:46:33, Jan Kara wrote:
> On Tue 15-03-16 11:46:34, Jan Kara wrote:
> > On Mon 14-03-16 10:36:35, Ted Tso wrote:
> > > On Mon, Mar 14, 2016 at 08:39:28AM +0100, Jan Kara wrote:
> > > > No, that won't be enough. blkdev_issue_flush() is not guaranteed to do
> > > > anything to IOs which have not reported completion before
> > > > blkdev_issue_flush() was called. Specifically, CFQ will queue submitted bio
> > > > in its internal RB tree, following flush request completely bypasses this
> > > > tree and goes directly to the disk where it flushes caches. And only later
> > > > CFQ decides to schedule async writeback from the flusher thread which is
> > > > queued in the RB tree...
> > > 
> > > Oh, right.  I am forgetting about the flushing mahchinery rewrite.
> > > Thanks for pointing that out.
> > > 
> > > But what we *could* do is to swap those two calls and then in the case
> > > where delalloc is enabled, could maintain a list of inodes where we
> > > only need to call filemap_fdatawait(), and not initiate writeback for
> > > any dirty pages which had been caused by non-allocating writes.
> > 
> > We actually don't need to swap those two calls - page is already marked as
> > under writeback in
> > 
> >   mpage_map_and_submit_buffers() -> mpage_submit_page -> ext4_bio_write_page
> > 
> > which gets called while we still hold the transaction handle. I agree
> > calling filemap_fdatawait() from JBD2 during commit should be enough to fix
> > issues with delalloc writeback. I'm just somewhat afraid that it will be
> > more fragile: If we add inode to transaction's list in ext4_map_blocks(),
> > we are pretty sure there's no way to allocate block to an inode without
> > introducing data exposure issues (which are then very hard to spot). If we
> > depend on callers of ext4_map_blocks() to properly add inode to appropriate
> > transaction list, we have much more places to check. I'll think whether we
> > could make this more robust.
> 
> OK, I have something - Huang, can you check whether the attached patches
> also fix your data exposure issues please? The first patch is the original
> fix, patch two is a cleanup, patches 3 and 4 implement the speedup
> suggested by Ted. Patches are only lightly tested so far.  I'll run more
> comprehensive tests later and in particular I want to check whether the
> additional complexity actually brings us some advantage at least for
> workloads which redirty pages in addition to writing some new ones using
> delayed allocation.

OK, there was a bug in patch 3. Attached is a new version of patches 3 and
4.
							Honza

View attachment "0003-jbd2-Add-support-for-avoiding-data-writes-during-tra.patch" of type "text/x-patch" (7472 bytes)

View attachment "0004-ext4-Do-not-ask-jbd2-to-write-data-for-delalloc-buff.patch" of type "text/x-patch" (4273 bytes)