linux-kernel - Re: [patch] Converting writeback linked lists to a tree based data structure

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-Id: <E1JFLEW-0002oE-G1@localhost.localdomain>
Date:	Thu, 17 Jan 2008 11:16:00 +0800
From:	Fengguang Wu <wfg@...l.ustc.edu.cn>
To:	David Chinner <dgc@....com>
Cc:	Andrew Morton <akpm@...ux-foundation.org>,
	Michael Rubin <mrubin@...gle.com>,
	Peter Zijlstra <a.p.zijlstra@...llo.nl>,
	linux-kernel@...r.kernel.org, linux-mm@...ck.org
Subject: Re: [patch] Converting writeback linked lists to a tree based data structure

On Thu, Jan 17, 2008 at 09:35:10AM +1100, David Chinner wrote:
> On Wed, Jan 16, 2008 at 05:07:20PM +0800, Fengguang Wu wrote:
> > On Tue, Jan 15, 2008 at 09:51:49PM -0800, Andrew Morton wrote:
> > > > Then to do better ordering by adopting radix tree(or rbtree
> > > > if radix tree is not enough),
> > > 
> > > ordering of what?
> > 
> > Switch from time to location.
> 
> Note that data writeback may be adversely affected by location
> based writeback rather than time based writeback - think of
> the effect of location based data writeback on an app that
> creates lots of short term (<30s) temp files and then removes
> them before they are written back.

A small(e.g. 5s) time window can still be enforced, but...

> Also, data writeback locatio cannot be easily derived from
> the inode number in pretty much all cases. "near" in terms
> of XFS means the same AG which means the data could be up to
> a TB away from the inode, and if you have >1TB filesystems
> usingthe default inode32 allocator, file data is *never*
> placed near the inode - the inodes are in the first TB of
> the filesystem, the data is rotored around the rest of the
> filesystem.
> 
> And with delayed allocation, you don't know where the data is even
> going to be written ahead of the filesystem ->writepage call, so you
> can't do optimal location ordering for data in this case.

Agreed.

> Hmmmm - I'm wondering if we'd do better to split data writeback from
> inode writeback. i.e. we do two passes.  The first pass writes all
> the data back in time order, the second pass writes all the inodes
> back in location order.
> 
> Right now we interleave data and inode writeback, (i.e.  we do data,
> inode, data, inode, data, inode, ....). I'd much prefer to see all
> data written out first, then the inodes. ->writepage often dirties
> the inode and hence if we need to do multiple do_writepages() calls
> on an inode to flush all the data (e.g. congestion, large amounts of
> data to be written, etc), we really shouldn't be calling
> write_inode() after every do_writepages() call. The inode
> should not be written until all the data is written....

That may do good to XFS. Another case is documented as follows:
"the write_inode() function of a typical fs will perform no I/O, but
will mark buffers in the blockdev mapping as dirty."

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/