[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-Id: <20070522112120.4a5c6a5d.akpm@linux-foundation.org>
Date: Tue, 22 May 2007 11:21:20 -0700
From: Andrew Morton <akpm@...ux-foundation.org>
To: Chris Mason <chris.mason@...cle.com>
Cc: Chuck Ebbert <cebbert@...hat.com>, linux-kernel@...r.kernel.org
Subject: Re: filesystem benchmarking fun
On Tue, 22 May 2007 12:35:11 -0400
Chris Mason <chris.mason@...cle.com> wrote:
> On Wed, May 16, 2007 at 01:37:26PM -0700, Andrew Morton wrote:
> > On Wed, 16 May 2007 16:14:14 -0400
> > Chris Mason <chris.mason@...cle.com> wrote:
> >
> > > On Wed, May 16, 2007 at 01:04:13PM -0700, Andrew Morton wrote:
> > > > > The good news is that if you let it run long enough, the times
> > > > > stabilize. The bad news is:
> > > > >
> > > > > create dir kernel-86 222MB in 15.85 seconds (14.03 MB/s)
> > > > > create dir kernel-87 222MB in 28.67 seconds (7.76 MB/s)
> > > > > create dir kernel-88 222MB in 18.12 seconds (12.27 MB/s)
> > > > > create dir kernel-89 222MB in 19.77 seconds (11.25 MB/s)
> > > >
> > > > well hang on. Doesn't this just mean that the first few runs were writing
> > > > into pagecache and the later ones were blocking due to dirty-memory limits?
> > > >
> > > > Or do you have a sync in there?
> > > >
> > > There's no sync, but if you watch vmstat you can clearly see the log
> > > flushes, even when the overall create times are 11MB/s. vmstat goes
> > > 30MB/s -> 4MB/s or less, then back up to 30MB/s.
> >
> > How do you know that it is a log flush rather than, say, pdflush
> > hitting the blockdev inode and doing a big seeky write?
>
> Ok, I did some more work to split out the two cases (block device inode
> writeback and log flushing).
>
> I patched jbd's log_do_checkpoint to put all the blocks it wanted to
> write in a radix tree, then send them all down in order at the end.
Side note: we already have all of that capability in the kernel:
sync_inode(blockdev_inode, wbc) will do an ascending-LBA write of the whole
blockdev.
It could be that as a quick diddle, running sync_inode() in
do-block-on-queue-congestion mode prior to doing the checkpoint would have
some benefit.
> The elevator should be helping here, but jbd is sending down 2,000
> to 3,000 blocks during the checkpoint and upping nr_requests alone
> didn't seem to be doing the trick.
>
> Unpatched ext3 would break down into seeks after 8 kernel trees are
> created (222MB each). With the radix sorting, the first 15 kernel trees
> are created quickly, and then we slow down.
>
> So I waited until around the 25th kernel tree was created, hit ctrl-c
> and ran sync. vmstat showed writes going at 2MB/s, and sysrq-w showed
> sync was running the block device inode for most of the 2MB/s period.
>
> It looks as though the dirty pages on the block device inode are spread
> out far enough that we're not getting good streaming writes. Mark
> Fasheh ran on a bigger raid array, where performance was consistently
> good for the whole run. I'm assuming the larger write cache on the
> array was able to group the data writes with the metadata on disk, while
> my poor little sata drive wasn't. Dave Chinner hinted that xfs is
> probably suffering a similar problem, which is usually fixed by backing
> the FS with stripes and big raid.
>
> My vaporware FS is able to maintain speed through the run because the
> allocator tries to keep data and metadata grouped into 256mb chunks,
> and so they don't end up mingling on disk until things get full.
>
> At any rate, it may be worth putzing with the writeback routines to try
> and find dirty pages close by in the block dev inode when doing data
> writeback. My guess is that ext3 should be going 1.5x to 2x faster for
> this particular run, but that's a huge amount of complexity added so I'm
> not convinced it is a great idea.
Yes, this is a distinct disadvantage of the whole per-address-space
writeback scheme - we're leaving IO scheduling optimisations on the floor,
especially wrt the blockdev inode, but probably also wrt regular-file
versus regular-file. Even if one makes the request queue tremendously
huge, that won't help if there's dirty data close-by the disk head which
hasn't even been put into the queue yet.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists