[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <BANLkTinndb4oHi6kNQrTsL86UrKDB=2rK6-gA8FM1Nj8CKYZ+w@mail.gmail.com>
Date: Fri, 1 Jul 2011 15:55:33 -0700
From: Curt Wohlgemuth <curtw@...gle.com>
To: Jan Kara <jack@...e.cz>
Cc: Christoph Hellwig <hch@...radead.org>,
Al Viro <viro@...iv.linux.org.uk>,
linux-fsdevel@...r.kernel.org, linux-kernel@...r.kernel.org,
fengguang.wu@...el.com
Subject: Re: [PATCH] writeback: Don't wait for completion in writeback_inodes_sb_nr
Hi Jan:
On Wed, Jun 29, 2011 at 12:15 PM, Jan Kara <jack@...e.cz> wrote:
> On Wed 29-06-11 13:55:34, Christoph Hellwig wrote:
>> On Wed, Jun 29, 2011 at 06:57:14PM +0200, Jan Kara wrote:
>> > > For sys_sync I'm pretty sure we could simply remove the
>> > > writeback_inodes_sb call and get just as good if not better performance,
>> > Actually, it won't with current code. Because WB_SYNC_ALL writeback
>> > currently has the peculiarity that it looks like:
>> > for all inodes {
>> > write all inode data
>> > wait for inode data
>> > }
>> > while to achieve good performance we actually need something like
>> > for all inodes
>> > write all inode data
>> > for all inodes
>> > wait for inode data
>> > It makes a difference in an order of magnitude when there are lots of
>> > smallish files - SLES had a bug like this so I know from user reports ;)
>>
>> I don't think that's true. The WB_SYNC_ALL writeback is done using
>> sync_inodes_sb, which operates as:
>>
>> for all dirty inodes in bdi:
>> if inode belongs to sb
>> write all inode data
>>
>> for all inodes in sb:
>> wait for inode data
>>
>> we still do that in a big for each sb loop, though.
> True but writeback_single_inode() has in it:
> if (wbc->sync_mode == WB_SYNC_ALL) {
> int err = filemap_fdatawait(mapping);
> if (ret == 0)
> ret = err;
> }
> So we end up waiting much earlier. Probably we should remove this wait
> but that will need some audit I guess.
So today for WB_SYNC_ALL from sync_inodes_sb(), we do:
- queue a work item; this will
- write all dirty inodes in a sb
- write one inode's pages
- wait on all inode's pages
- wait for the work item
- wait on all inodes in the sb (wait_sb_inodes)
I guess the point of wait_sb_inodes() is to wait on all inodes that
were written in a previous writeback pass.
One other issue I have with sync as it's structured is that we don't
do a WB_SYNC_ALL pass on any inode that's only associated with a block
device, and not on a mounted filesystem. Blockdev mounts are
pseudo-mounts, and are explicitly skipped in __sync_filesystem(). So
if you've written directly to a block device and do a sync, the only
pass over the pages for this inode are via the
wakeup_flusher_threads() -- which operates on a BDI, regardless of the
superblock, and uses WB_SYNC_NONE.
All the sync_filesystem() calls are per-sb, not per-BDI, and they'll
exclude pseudo-superblocks.
I've seen cases in our modified kernels here at Google in which
lilo/shutdown failed because of a lack of WB_SYNC_ALL writeback for
/dev/sda (though I haven't been able to come up with a consistent test
case, nor reproduce this on an upstream kernel).
Thanks,
Curt
>
>> > You mean that sync(1) would actually write the data itself? It would
>> > certainly make some things simpler but it has its problems as well - for
>> > example sync racing with flusher thread writing back inodes can create
>> > rather bad IO pattern...
>>
>> Only the second pass. The idea is that we first try to use the flusher
>> threads for good I/O patterns, but if we can't get that to work only
>> block the caller and not everyone. But that's just an idea so far,
>> it would need serious benchmark. And despite what I claimed before
>> we actually do the wait in the caller context already anyway, which
>> already gives you the easy part of the above effect.
> Modulo the writeback_single_inode() wait. But if that is dealt with I
> agree.
>
> Honza
> --
> Jan Kara <jack@...e.cz>
> SUSE Labs, CR
>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists