linux-kernel - Re: [PATCH] jbd2: avoid the concurrent data writeback

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [day] [month] [year] [list]

Message-ID: <20101117093621.29fed027@feng-i7>
Date:	Wed, 17 Nov 2010 09:36:21 +0800
From:	Feng Tang <feng.tang@...el.com>
To:	Jan Kara <jack@...e.cz>
CC:	no To-header on input <""@suse.de>,
	"Wu, Fengguang" <fengguang.wu@...el.com>,
	"linux-ext4@...r.kernel.org" <linux-ext4@...r.kernel.org>,
	"linux-fsdevel@...r.kernel.org" <linux-fsdevel@...r.kernel.org>,
	LKML <linux-kernel@...r.kernel.org>
Subject: Re: [PATCH] jbd2: avoid the concurrent data writeback

Hi Jan,

On Tue, 16 Nov 2010 20:13:23 +0800
Jan Kara <jack@...e.cz> wrote:

>   Hi,
> 
>   sorry for chiming in a bit late...
> On Mon 15-11-10 17:59:43, Feng Tang wrote:
> > From b16cfc5a560f2549ac69dbb235a550500ea1719f Mon Sep 17 00:00:00
> > 2001 From: Feng Tang <feng.tang@...el.com>
> > Date: Mon, 15 Nov 2010 21:06:44 +0800
> > Subject: [PATCH] jbd2: avoid the concurrent data writeback
> > 
> > When dd a big file to an ext4 partition, it is very likely to happen
> > that both the background flush thread and kjounald try to do data
> > writeback for it, that the flush thread is doing the writeback for
> > this file and jbd2 thread are also waken up to commit the
> > transaction. Because kjounald only calls the generic_writepages()
> > whose path doesn't really allocate disk blocks, the ext4_witepage()
> > may be called lots of times (100000+ for a 1g file dd) without
> > really writing one page back (skipped), which will consume lots of
> > unnecessary CPU time
> > 
> > This could be found by a simple test case with ftrace:
> > $ sync;
> > $ echo 40960 > buffer_size_kb;echo 1 > events/writeback/enable;echo
> > 1 > events/jbd2/enable;echo 1 > events/ext4/enable; $ dd
> > if=/dev/zero of=/home/test/1g.bin bs=1M count=1024;sync; $ cat
> > trace > /home/test/jbd2_ext4_1g_dd.log $ grep -c
> > wcb_writepage /home/test/jbd2_ext4_1g_dd.log
> > 
> > This patch will check if the inode is under data syncing, if yes
> > then don't start the writeback from kjournald
> > 
> > The Perf statics (On my Core Duo 2 + 4G RAM + SATA disk + Ext4 in
> > all default modes): before the patch >  112191
> > writeback:wbc_writepage  #      0.005 M/sec after the patch  >  54
> > writeback:wbc_writepage  #      0.000 M/sec
> > 
> > Signed-off-by: Feng Tang <feng.tang@...el.com>
> > ---
> >  fs/jbd2/commit.c |   11 +++++++++++
> >  1 files changed, 11 insertions(+), 0 deletions(-)
> > 
> > diff --git a/fs/jbd2/commit.c b/fs/jbd2/commit.c
> > index f3ad159..0f3e356 100644
> > --- a/fs/jbd2/commit.c
> > +++ b/fs/jbd2/commit.c
> > @@ -170,6 +170,10 @@ static int
> > journal_wait_on_commit_record(journal_t *journal,
> >   * We don't do block allocation here even for delalloc. We don't
> >   * use writepages() because with dealyed allocation we may be doing
> >   * block allocation in writepages().
> > + *
> > + * Sometimes when this get called, the host inode may be under data
> > + * syncing initiated by flush thread(especially for a large file),
> > and 
> > + * in such situation, we should skip this path of writeback
> >   */
> >  static int journal_submit_inode_data_buffers(struct address_space
> > *mapping) {
> > @@ -181,6 +185,13 @@ static int
> > journal_submit_inode_data_buffers(struct address_space
> > *mapping) .range_end = i_size_read(mapping->host), };
> >  
> > +	spin_lock(&inode_lock);
> > +	if (mapping->host->i_state & I_SYNC) {
> > +		spin_unlock(&inode_lock);
> > +		return 0;
> > +	}
> > +	spin_unlock(&inode_lock);
> > +
>   Sorry, but this is just wrong. Not only because of inode_lock as
> Christoph pointed out but mainly principially. ext4 and ocfs2 in
> data=ordered mode rely on data pages (with underlying blocks already
> allocated) being written out before transaction commit proceeds for
> data integrity. So you cannot just go and remove the writeback saying
> it improves performance.
> 
> I'm not saying that ext4 handling of ordered mode does not need a
> revision (we actually talked with Ted about it at Kernel Summit). But
> the solution for it is to use IO completion callback to do extent
> tree manipulations and stop using JBD2 for data syncing. We already
> do that for direct IO and conversion of preallocated space so doing
> it in all cases should be reasonably easy. Until that happens, you
> can run ext4 in data=writeback mode which will also stop JBD2 from
> doing the writeback (and effectively is rather similar to your patch).

Glad to know that the revision is on the way, and thanks for the 
detailed clarification.

- Feng 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/