linux-kernel - Re: [RFC] page-writeback: move indoes from one superblock together

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20090925041619.GB9464@discord.disaster>
Date:	Fri, 25 Sep 2009 14:16:19 +1000
From:	Dave Chinner <david@...morbit.com>
To:	Wu Fengguang <fengguang.wu@...el.com>
Cc:	Arjan van de Ven <arjan@...radead.org>,
	Jens Axboe <jens.axboe@...cle.com>,
	"Li, Shaohua" <shaohua.li@...el.com>,
	lkml <linux-kernel@...r.kernel.org>,
	Peter Zijlstra <a.p.zijlstra@...llo.nl>,
	Andrew Morton <akpm@...ux-foundation.org>,
	Chris Mason <chris.mason@...cle.com>,
	"linux-fsdevel@...r.kernel.org" <linux-fsdevel@...r.kernel.org>,
	Jan Kara <jack@...e.cz>
Subject: Re: [RFC] page-writeback: move indoes from one superblock together

On Thu, Sep 24, 2009 at 10:09:19PM +0800, Wu Fengguang wrote:
> On Thu, Sep 24, 2009 at 09:52:17PM +0800, Arjan van de Ven wrote:
> > On Thu, 24 Sep 2009 21:46:25 +0800
> > Wu Fengguang <fengguang.wu@...el.com> wrote:
> > > 
> > > Note that dirty_time may not be unique, so need some workaround.  And
> > > the resulted rbtree implementation may not be more efficient than
> > > several list traversals even for a very large list (as long as
> > > superblocks numbers are low).
> > > 
> > > The good side is, once sb+dirty_time rbtree is implemented, it should
> > > be trivial to switch the key to sb+inode_number (also may not be
> > > unique), and to do location ordered writeback ;)
> > 
> > would you want to sort by dirty time, or by inode number?
> > (assuming inode number is loosely related to location on disk)
> 
> Sort by inode number; dirty time will also be considered when judging
> whether the traversed inode is old enough(*) to be eligible for writeback.

Even if the inode number is directly related to location on disk
(like for XFS), there is no guarantee that the data or related
metadata (indirect blocks) writeback location is in any way related
to the inode number. e.g when using the 32 bit allocator on XFS
(default for > 1TB filesystems), there is _zero correlation_ between
the inode number and the data location. Hence writeback by inode
number will not improve writeback patterns at all.

Only the filesystem knows what the best writeback pattern really is;
any change is going to affect filesystems differently.

> The more detailed algorithm would be:
> 
> - put inodes to rbtree with key sb+inode_number
> - in each per-5s writeback, traverse a range of 1/5 rbtree
> - in each traverse, sync inodes that is dirtied more than 5s ago
>
> So the user visible result would be
> - on every 5s, roughly a 1/5 disk area will be visited
> - for each dirtied inode, it will be synced after 5-30s

Personally, I'd prefer that writeback calls a vector that says
"writeback inodes older than N" and implement something like the
above as the generic mechanism. That way filesystems can override
the generic algorithm if there is a better way to track and write
back dirty inodes for that filesystem.

Cheers,

Dave.
-- 
Dave Chinner
david@...morbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/