[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20080116223510.GY155407@sgi.com>
Date: Thu, 17 Jan 2008 09:35:10 +1100
From: David Chinner <dgc@....com>
To: Fengguang Wu <wfg@...l.ustc.edu.cn>
Cc: Andrew Morton <akpm@...ux-foundation.org>,
Michael Rubin <mrubin@...gle.com>,
Peter Zijlstra <a.p.zijlstra@...llo.nl>,
linux-kernel@...r.kernel.org, linux-mm@...ck.org
Subject: Re: [patch] Converting writeback linked lists to a tree based data structure
On Wed, Jan 16, 2008 at 05:07:20PM +0800, Fengguang Wu wrote:
> On Tue, Jan 15, 2008 at 09:51:49PM -0800, Andrew Morton wrote:
> > > Then to do better ordering by adopting radix tree(or rbtree
> > > if radix tree is not enough),
> >
> > ordering of what?
>
> Switch from time to location.
Note that data writeback may be adversely affected by location
based writeback rather than time based writeback - think of
the effect of location based data writeback on an app that
creates lots of short term (<30s) temp files and then removes
them before they are written back.
Also, data writeback locatio cannot be easily derived from
the inode number in pretty much all cases. "near" in terms
of XFS means the same AG which means the data could be up to
a TB away from the inode, and if you have >1TB filesystems
usingthe default inode32 allocator, file data is *never*
placed near the inode - the inodes are in the first TB of
the filesystem, the data is rotored around the rest of the
filesystem.
And with delayed allocation, you don't know where the data is even
going to be written ahead of the filesystem ->writepage call, so you
can't do optimal location ordering for data in this case.
> > > and lastly get rid of the list_heads to
> > > avoid locking. Does it sound like a good path?
> >
> > I'd have thaought that replacing list_heads with another data structure
> > would be a simgle commit.
>
> That would be easy. s_more_io and s_more_io_wait can all be converted
> to radix trees.
Makes sense for location based writeback of the inodes themselves,
but not for data.
Hmmmm - I'm wondering if we'd do better to split data writeback from
inode writeback. i.e. we do two passes. The first pass writes all
the data back in time order, the second pass writes all the inodes
back in location order.
Right now we interleave data and inode writeback, (i.e. we do data,
inode, data, inode, data, inode, ....). I'd much prefer to see all
data written out first, then the inodes. ->writepage often dirties
the inode and hence if we need to do multiple do_writepages() calls
on an inode to flush all the data (e.g. congestion, large amounts of
data to be written, etc), we really shouldn't be calling
write_inode() after every do_writepages() call. The inode
should not be written until all the data is written....
Cheers,
Dave.
--
Dave Chinner
Principal Engineer
SGI Australian Software Group
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists